OPTIMAL  and  HCURISTIC  SYNThCSXS  OP  HIERARCHICAL  CLASSIFIERS. (U) 
AU«  76  A V KULKAPNI  AF-AFOSR-2901-76 

unclassified  TR-6M  AFOSR-T||*77-Oa2S  NL 


1842ml 


COMPUTER  SCIENCE 
TECHNICAL  REPORT  SERIES 


UNIVERSITY  OF  MARYLAND 

COLLEGE  PARK,  MARYLAND 
20742 

D D C 


nr, 

JUL  28  1977 


AIR  FORCE  OFFICE  OF  SCIENTIFIC  RESEARCH  (AFSC) 
NOTICE  OF  TRANSMITTAL  TO  DDC 

■;hl3  leoImlOQl  report  has  been  revletured  and  ir. 

pproved  for  public  release  lAW  AFR  190-12  l?b). 
Ulntrlbutloa  is  unlimited. 

A.  D.  BLOSS 

Teohnloal  Information  Offioer 


/ 


, / - 


Tprhnir-Al  Rpnnrt;  TR- AfiQ  f 

r - ENG'73-^4(y99/&  AFOSft  76-2901 


\jjj- 

j August  1^76 


' C ) OPTIMAL  AND  HEURISTIC  SYNTHESIS 
OF  HIERARCHICAL  CLASSIFIERS, 


j I By 

^ ^ Ashok  Vasal 


rm 


Dissertation  submitted  to  the  Faculty  of  the  Graduate  School  of 
the  University  of  Maryland  in  partial  fulfillment  of  the  require- 
ments for  the  degree  of  Doctor  of  Philosophy,  1976. 


This  research  was  supported  in  part  by  the  Mathematics  and  Information 
Sciences  Directorate,  Air  Force  Office  of  Scientific  Research,  Air 
Forcy  Systems  Command,  USAF,  under  grant  AFOSR  76-2901,  and  in  part  by 
the  National  Science  Foundation,  Control  and  Automation  Branch,  Engineer- 
ing Division,  under  grant  ENG  73-04099,  to  the  Laboratory  for  Pattern 
Analysis,  Department  of  Computer  Science,  University  of  Maryland, 

College  Park,  Maryland. 


- / J % 


4 


ABSTRACT 


Title  of  Thesis:  Optimal  and  Heuristic  Synthesis  of  Hierarchical  Classifiers 

, Ashok  V.  Kulkarni,  Doctor  of  Philosophy,  1976 

I Thesis  directed  by:  Dr.  Laveen  Kanal,  Professor  of  Computer  Science 

■ • Department  of  Computer  Science 

I Multistage  schemes  such  as  hierarchical  classifiers  have  been  found 

I » 

useful  for  many  multiclass  pattern  recognition  tasks.  This  dissertation 
investigates  the  theoretical  properties  of  a general  model  of  multistage 
multi  cl  ass  recognition  schemes.  The  generality  of  the  model  allows  one 
to  describe  a large  class  of  parametric  and  non-parametric  schemes  com- 
monly used  in  terms  of  the  model  parameters.  Two  classes  of  admissible 
and  optimal  strategies  for  obtaining  the  optimal  decision  are  analyzed. 

These  strategies  employ  lower  and  upper  bounds  on  a risk  function  to 
^ improve  the  search  efficiency.  New  methods  of  computing  the  bounds  are 

investigated  for  the  cases  when  the  features  are  class-condi tional ly 
statistically  independent  and  where  they  satisfy  a first-order  tree  de- 
pendence relation.  Bounds  are  also  derived  for  use  in  nearest-neighbor 
classification  schemes  employing  a Euclidean  distance  measure  and  various 

similarity  measures  for  non-metric  feature  vectors. 

I 

Hierarchical  classifiers  are  special  types  of  multistage  recognition 
schemes  wherein  at  each  stage  certain  classes  are  rejected  from  considera- 
tion as  labels  of  the  test  sample.  Theoretical  properties  of  decision  trees 

r«  4 

whose  node  decisions  are  statistically  independent  are  Investigated.  Even 
. under  this  independence  assumption  the  optimal  tree  design  task  is  a com- 

plex one. 

A three  phase  decomposition  of  the  tree  design  problem  is  proposed 
viz.  tree  skeleton  design,  feature  selection  at  its  nodes  and  decision  func- 
tion design  ?t  each  node.  Optimal  solutions  to  each  design  phase  are 


obtained  using  a dynamic  progranming  formulation.  I 

These  optimal  design  methods  rapidly  become  cumbersome  in  computational 
resources  as  the  number  of  features  and  classes  increase.  This  study  pro- 
poses various  techniques  for  reducing  the  computational  complexity  incurred 
in  finding  the  optimal  features  to  be  measured  at  each  node  and  the  optimal 
decision  policy.  A method  of  clustering  decision  rules  and  rejecting  sets 
of  suboptimal  rules  without  evaluating  each  individual  one  is  proposed. 
Feature  ranking  and  a branch-and-bound  method  are  described  for  reducing 
the  possible  feature  assignments  to  be  considered  in  finding  the  optimal 
feature  measurement  policy. 

In  practice,  the  decision  rules  at  the  nodes  have  to  be  estimated 
from  a finite  set  of  design  samples.  This  work  investigates  the  relation- 
ship between  the  expected  tree  performance,  sample  size  and  the  number  of 
states  (quantization  levels)  of  each  feature.  It  is  shown  that  for  an  M- 
class  recognition  scheme  using  a decision  tree,  there  exists  an  optimal 
quantization  complexity.  The  optimum  complexity  increases  with  sample  size 
and  with  the  number  of  classes  to  be  distinguished.  For  small  sample  sizes, 
it  is  shown  that  a multistage  decision  scheme  can  have  a lower  error  rate 
than  a single  stage  scheme  which  uses  all  the  available  measurements  in 


ACKNOWLEDGEMENT 


I wish  to  express  my  gratitude  to  Professor  Laveen  Kanal 
who  Inspired  this  research  and  guided  and  encouraged  me  during 
its  development.  My  thanks  also  go  to  Dr.  Ashok  Agrawala  for 
his  various  suggestions  for  Improving  the  presentation  of  the 
material,  and  the  many  useful  discussions  we  have  had  during 
Uiese  past  few  years.  Finally,  my  heartfelt  thanks  go  to  my 
wife  Ranjana,  who  encouraged  me  along  the  way. 


Se:'ion  □ 


ACC^o.'CrJ  for 

Section 
DCC 

U.'MNNOjr.'C-O 
J'JSUrlGAIIC'J 


BY 

Jisjmmmm'jn  ems 

jjiJyo:  SPtCIAL  I 


Table  of  Content  s 


1.  Introduction  ..  •.»•••••••  *1 

1.1.  A System  Configuration  2 

1.2.  The  Design  Tradeoffs  ^ 

1.2.1.  Feature  Selection  .....  7 

1.3.  Literature  On  Multistage  Classification  .......  10 

1. A.  Scope  Of  This  Research  ...............  16 

2.  General  Model  of  Multistage  Classification  ........  22 

2.1.  State-Space  Model  ..................  23 

2.2.  S -ad m i ^ s i b I e Strategies  ...  .......  25 

2.2.1.  Algorithm  S.. ............ ......  26 

2.2.2.  k-Step  Lookahead  Heuristic  ............  29 

2.3.  B-admissible  Strategies  ...............  31 

2.3.1.  Algorithm  6.. 32 

2.3.2.  Bayes  Optimal  Strategy  ..............  35 

2.3.3.  Graphical  Representation  of  B-admissible  Search  . . 36 

2.  A.  Methods  Of  Computing  Bounding  Functions  .......  38 

2.A.I.  Statistically  Independent  Features  ........  39 

2.4.2.  Tree  Dependent  Features  ..............  44 

2. 4. 2.1.  Computation  Of  M i n i ma I /Ma x ima I Spanning  Tree  . . 49 

2.4.3.  Nearest  Neighbour  Classification  .........  54 

2. 4. 3.1.  The  Mode  I 55 

2. 4. 3. 2.  Euclidean  Measure  . . . . . . . . . . . . . . . . 57 

2. 4. 3. 3.  Ultrametric  Measure  ...............  59 

2.4.4.  Bounds  on  Similarity  Measures  For  Binary  Vectors  . 60 

2. 4. 4.1.  BoundSf  SI  « and  Su  61 

2.5.  Hierarchical  Classifiers  ..............  65 

2.6.  Hart's  Probabilistic  Decision  Tree  Model  ......  68 

3.  Properties  Of  Hierarchical  Classifiers  ..........  71 

3.1.  Notation  ......................  73 

3.2.  Performance  Of  A Decision  Tree  ...........  74 

3.2.1.  Probability  Of  Correct  Recognition  ........  74 

3.2.2.  Other  Measures  Of  Tree  Performance  75 

3.3.  Error  in  Assuming  Sum  of  Products  Form  of  Pc(T)  ...  76 

i 


3.<i.  A Property  Of  The  Optimal  Decision  Policy  ......  81 

3«4>1*  A bound  On  The  Tree  Performance  ••«••«••••  85 

3.5.  A Property  Of  The  Optimal  feature  Assignment  ....  90 

4.  A Phased  Approach  To  Optimal  Tree  Design  .........  94 

4.  Computa t ional  Complexity  .................  95 

4.1.  Tree  Design  Using  Dynamic  Programming  ........  96 

4.1.1.  Optimal  Decision  Policy  Given  The  Tree  ......  97 

4.1.2.  Optimal  Feature  Ordering  and  Decision  Policy  ...  100 

4.1.3.  Optimal  Tree  Structure  ..............  102 

4. 1.3.1.  The  Additive  Cost  Assumption  ..........  104 

5.  Methods  Of  Reducing  The  Computational  Complexity  .....  108 

5.1.  Decision  Policy  Design  Given  The  Tree  ........  109 

5.1.1.  Optimal  One-Step  Decision  Policy  .........  115 

5.1.2.  Clustering  in  Decision  Space  . ..........  117 

5. 1.2.1.  Similarity  Measure  For  Decision  Vectors  .....  119 

5. 1.2. 2.  Clustering  Vectors  in  M-space  ..........  122 

5.1.3.  Efficient  Decision  Strategies  ...........  123 

5.2.  Feature  Selection  at  Tree  Nodes  . ..........  133 

5.2.1.  A Dynamic  Programming  Formulation  .........  134 

5.2.2.  A Branch  and  Bound  Formulation  ..........  136 

5.2.3.  Feature  Ranking  ........  ..........  141 

6.  Estimation  Of  Decision  Rules  From  Finite  Samples  .....  142 


6.1.  Estimation  Of  Discrete  Probab  i I i t i es  Of  Mixtures  . . 145 

6.2.  Mean  Accuracy  Of  A Hierarchical  Classifier  .....  149 

6.3.  Hierarchical  Classifier  Versus  One-Step  Classifier  . 162 


7.  Conclusions  and  Directions  for  Further  Research  ......  167 

8.  References  ........................  171 


ii 


I 


LIST  OF  TABLES 


Table  Page 

1.  Variation  of  Pc(t)  with  node  performance 

(Graph  1) 89 

2.  Variation  of  mean  accuracy  as  a function  of 

sample  size,  quantization  complexity  and  the 
number  of  classes  (Graph  2) I6l 

3.  Error  rate  versus  sample  size  for  a one-step 

classifier  and  a decision  tree  (Graph  3) 166 


1 


1 . 1 nt  roduc  t i on 

In  pattern  recognition  practisei  decision  trees  have 
been  extensively  used  for  multiclass  classification  tasks, 
a However,  theoretical  developments  in  pattern  recognition 

have  essentially  avoided  addressing  the  problem  of  designing 
such  hierarchical  classifiers.  In  practice,  variou'_ 
heuristic  and  ad  hoc  methods  are  employed.  This 
dissertation  investigates  the  theoretical  properties  of  a 
general  model  of  multistage  multiclass  recognition  schemes. 
The  generality  of  the  model  allows  one  to  describe  a large 
class  of  parametric  and  non-pa r ame t r ic  schemes  proposed  in 
the  literature  in  terms  of  the  model  parameters. 
Hierarchical  classifiers  comprise  an  important  special  case 
of  this  model.  We  derive  new  theoretical  properties  of  such 
classifiers  and  investigate  the  use  of  optimiyation  methods 
for  their  design. 

This  chapter  presents  the  background  for  the  problem  and 
the  motivation  which  led  to  the  investigation  reported  in 
subsequent  chapters,  and  outlines  the  scope  of  this 
dissertation.  The  first  section  describes  the  components  of 
a statistical  pattern  recognition  system  and  the  various 
design  phases  that  guide  its  development.  Section  2 outlines 
the  trade-offs  between  three  factors,  viz.,  classifier 
performance,  complexity,  and  measurement  cost,  that  occur  in 

( the  design  of  any  practical  system.  The  various  multistage 

' t 

schemes  proposed  in  the  literature  for  achieving  this 
trade-off  are  reviewed  in  section  3.  The  last  section 
' summarires  the  contributions  of  this  dissertation  and  its 

relationship  to  the  existing  literature  on  multistage 
multiclass  recognition. 

V 

$ 

t 

I 

I 

i 


I 


Flow  of  data  in  operational  system 
— > Design  influences 

Flow  of  information  during  the  design  cycle 
Fig.  1.1 


Host  statistical  pattern  recognition  systems  can  be 
divided  into  four  phases  (Fig.  1.1)  through  which  an 
observed  sample  passes  before  it  is  c I a s s i f i ed [ 8 3 . These  are 
(i)  a measurement  phase,  (ii)  a feature  extraction  stage, 
(iii)  a phase  during  which  a set  of  discriminant  functions 
is  applied  to  the  feature  vector  extracted.  and.  (iv)  a 
decision  logic  step  that  uses  the  outcomes  of  the  functions 
to  classify  the  sample. 

The  raw  measurements  consist  of  a potentially  large  set 
of  measurements  that  can  be  taken  from  a sample  and  is 
guided  by  problem  knowledge  which  suggests  measurements  of 
interest  and  the  state  of  the  art  of  transducer  technology 
which  dictates  what  measurements  are  feasible.  Given  this 
measurement  set.  certain  features  may  be  synthesized  on  the 
basis  of  prior  information,  (e.g..  features  such  as  the  area 
and  perimeter  of  a nucleus  of  a white  blood  cell).  while 


J 


other  features  may  be  entractecl  from  data  auatysts 
exjieri  merits  conducted  in  an  interactive  mode,  liiicai  anr' 
nonlinear  mappings  to  one-space  or  two-space,  me' 

discriminant  vector  projections,  fall  into  this  I a t ( »' ( 
category.  During  this  data  analysis  phase,  certain 

important  information,  such  as  the  modes  of  a distribution 
(obtained  via  clustering  experiments,  for  example)  may  lie 
gathered  and  pas’sed  on  to  the  subsequent  feature  selection 
ti  h a r.  e . 

Having  extracted  a set  of  N features,  the  feature 
selection  task  consists  of  finding  the  "best"  subset  of  n 
out  of  N features  which  will  optimize  the  classifier 
performance.  Since  the  goodness  of  a set  of  ieatures  depends 
on  the  form  of  the  classifier  in  which  they  will  be  used, 
the  feature  selection  and  classifier  design  tasks  are 
strongly  coupled. 

The  classifier  design  task  consists  of  estimating 
parameters  of  the  discriminant  functions  and  designing  the 
decision  logic  to  optimize  the  classifier  performance  in  the 
field.  The  entire  design  cycle  may  have  available  to  it  a 
set  of  labelled  or  unlabelled  samples  from  which  to  glean 
the  information  it  needs  in  each  phase. 

The  implementation  of  such  a system  involves  several 
iterations  through  the  design  cycle  since  design  choiies 
made  at  one  stage  affect  alt  subsequent  stages.  While  the 
feature  extraction  phase  is  problem  dependent,  the  feature 
selection  and  classifier  design  steps  can  be  automated. 

1.?.  The  Design  Tradeoffs 

The  design  procedure  of  a practical  pattern  recognition 
scheme  must  contend,  in  general,  with  three  counteracting 
factors,  viz.,  the  performance  of  the  classifier,  the 
complexity  of  the  design,  and  the  cost  of  taking  feature 


i 

i 


A 

measurements.  The  components  o1  these  factors  contributing 

to  the  design  problem  are  depicted  as  a tree  diagram,  in 

Fig.  1.Z  below. 

Factors 

FleaTurement  cost 

Measurement  Decision 

Complexity  Complexity 


Fig.  1,2 


The  performance  index  of  a classifier  is  usually  taken 
to  be  the  average  m i sc  I a s s i f i c a t ion  rate,  or  average  toss. 
If  X represents  a random  sample  (set  of  feature 
measurements),  having  a class  conditional  probability 
(unction,  p(X/wi>,  d(X)  is  a decision  rule  that  classifies  X 
into  one  of  M classes,  w1,....wM,  and  C(wi,wj)  denotes  the 
cost  (loss)  incurred  in  classifying  an  wi  sample  into  wj , 
then  the  average  loss  is  given  by. 


R 


C(wi ,d(X)) 


p (X /wi ) 


( T . 1) 


where  P(wi)  is  the  a priori  class  probability  of  class 

w i . 

In  equation  (1.1),  X refers  to  a vector  of  features  to 
be  measured.  There  is  a cost  associated  with  taking  new 
measurements,  such  as  the  cost  of  added  instrumentation  or 
computational  resources.  Sometimes  this  cost  may  be  an 
intangible  quantity.  In  medical  diagnosis,  for  example. 


taking  certain  measurements  may  affect  the  patient's  health. 


5 


or  it  may  take  several  days  to  be  performed,  during  which 
time  the  patient  might  remain  hospitalized.  This  cost, 
referred  to  as  measurement  cost,  is  incurred  whenever  the 
decision  rule,  d(),  requires  the  particular  feature 
measurement.  It  is  assumed  in  this  discussion  that 
classifier  performance  and  costs,  whether  real  or 
intangible,  can  be  expressed  in  some  common  quantitative 
terms. 

The  phrase  'complexity  of  the  classifier',  is  used  here 
as  a general  term  to  describe  two  factors,  (1)  the 
measurement  complexity,  which  depends  on  the  number  of 
features  used  and  the  possible  states  (values)  that  these 
can  assume,  or  parameters  needed  to  describe  their 
distributions,  and  (2)  the  decision  complexity,  which 
depends  on  the  form  of  the  decision  rule,  d(X),  i.e.  the 
parameters  needed  to  describe  it.  For  a parametric 
classification  scheme,  the  measurement  complexity  controls 
the  number  of  parameters  to  be  estimated,  e.g.  if  N 

statistically  independent  features  are  used,  and  each 

« 

feature  is  discrete,  taking  on  one  of  m values,  the  total 
probabilities  to  be  estimated  per  class  are,  N.m  . For  a 
given  desired  error  rate,  and  a fixed  feature  set,  the 
minimum  decision  complexity  is  a function  of  the  'ease'  of 
separability  of  the  classes  in  that  feature  space.  In  other 
words,  if  a sample  vector,  X,  is  regarded  as  a point  in 
feature  space,  and  if  d(X)  is  a surface  that  partitions  this 
space  into  disjoint  regions,  each  having  a class  label, (i.e. 
the  decision  made  if  a sample  lies  in  that  region),  then  for 
a given  error  rate,  e,  the  minimum  number  of  parameters 
needed  to  specify  the  surface,  d(X),  would  be  larger  if  the 
classes  were  highly  overlapped,  and  small  if  they  were 
perfectly  separable,  or  overlapping  only  slightly. 

It  has  been  shown  C17,27]  that  the  error  rate  can  be 


6 


I made  arbitrarily  close  to  aero  if  one  had  an  infinite  supply 

of  feaiurest  where  each  feature  contributed  to  the  class 

discrimination.  In  practice*  however*  a zero  error  rate  is 

often  unachievable.  First*  the  set  of  potential 

; discriminatory  features*  though  large*  cannot  be  infinite* 

I 

since  in  most  applications*  these  are  band-limited 

i observations  C173. 

Secondly*  in  most  cases  one  has  to  estimate  the 

parameters  of  the  assumed  underlying  distribution  using  a 
finite  set  of  samples.  It  has  been  shown  C17D*  that  there 
exists  a relationship  between  the  optimum  measurement 

complexity  and  the  sample  size*  in  that  if  any  more  or  fewer 
than  this  optimum  number  of  features  are  used  in  a 
classifier*  its  performance  would  be  worse  than  the  optimal 
value. 

For  a given  set  of  features*  the  Bayes  rule  minimizes 
the  average  risk.  However*  for  most  problems  of  interest* 

the  Bayes  decision  surface  is  too  complex  and  would  require 
excessive  storage  and  computational  cost.  Hence*  to  reduce 
the  classifier's  design  and  running  cost*  this  surface  is 
often  approximated  by  a simpler  surface*  e.g.  a set  of 
piecewise  linear  hyperplanes  might  be  used  to  approximate  a 
polynomial  surface.  The  error  rate  using  this  approximation 
is  larger  than  the  Bayes  error  rate. 

Finally*  the  cost  of  taking  a particular  measurement  at 
a certain  stage  of  the  decision  process  might  not  be 
justified  in  view  of  the  small  improvement  in  performance  it 
would  give.  Hence*  this  cost  also  restricts  the  performance 
of  a classifier  for  certain  applications. 

Hence*  measurement  cost*  decision  complexity  and  the 
measurement  complexity*  sample  size  relationship*  place  an 
upper  bound  on  the  performance  of  a practical  c I a ss  i f i c a t ion 
scheme.  The  interaction  between  these  factors  gives  rise  to 


5 


7 


the  feature  selection  problem,  since  one  is  forced  to  limit 
the  number  of  features  to  be  used  in  the  classifier.  The 
trade-off  between  performance  and  complexity  gives  rise  to 
the  classifier  and  decision  rule  design  problems.  These 

t 

problems  are  accentuated  for  multiclass  and  multimodal 
classification  tasks.  We  shall  briefly  describe  the  various 
feature  selection  methods  proposed  in  the  literature,  and 
point  out  some  of  their  limitations  when  applied  to  a 
multiclass  discrimination  problem. 

1.2.1.  Feature  Selection 

Feature  selection,  as  usually  discussed  in  the 

literature,  is  the  problem  of  finding  the  best  set  of  n out 

of  a total  of  N available  features,  that  will  optimise  the 

performance  of  a classifier.  This  performance  is  assumed  to 

be  the  Bayes  error  rate.  Where  this  is  difficult  to  compute, 

various  loose  bounds  on  this  error  are  used  . The  best  set 

N 

of  n features  can  always  be  found  by  examining  all  ( ) 

combinations.  However,  for  large  N one  may  want  a more 
efficient  search  strategy  that  finds  a good  though  not 
necessarily  optimal  subset  of  features.  The  search  for  the 
optimal  set  is  complicated  by  the  fact  that  the  best  k out 
of  N features  need  not  be  contained  in  the  best  k4l  out  N 
features,  even  if  the  features  are  statistically  independent 
, [353.  The  different  feature  selection  methods  differ  in  the 

criterion  function  sought  to  be  optixiaed,  and  the  method  of 
search  . 

Let  G denote  the  function  that  evaluates  the  feature  set 
goodness.  Two  types  of  G functions  have  been  proposed.  The 
first  type  are  various  distance  and  information  measures 
which  a'e  loosely  related  to  the  Bayes  error.  A list  of  such 
measures  appea rs  in  [83.  All  these  functions  involve  the 
c I a s s-den  s i t y f'.nctions  which  may  have  to  be  estimated  . The 


8 


I 





t 


i 


second  category  of  6 functions  may  be  defined  as  geometric 
measures.  Some  of  these  are  described  in  [213.  Functions  of 
be t wee n-c I a s s and  within-class  scatters«  fall  in  this 
category. 

The  search  methods  may  be  divided  into  four  groups,  (i) 
single  step  selection,  (ii)  w i t h ou t -r e p I a c erne n t method, 
(iii)  iterative  refinement  method,  and  (iv)  the  recursive 
method. 

In  the  single-step  method,  the  n features  are  chosen  in 
one  step,  e.g.  the  n features  which  contribute  the  largest 
amounts  to  the  eigenvectors  corresponding  to  the  m largest 
eigenvalues  of  the  covariance  matrix  C213.  Other  methods  are 
contained  in  [ 29  ,303. 

The  w i t h ou t- re p I ac em en t policy  consists  of  adding  to  the 
subset  of  k features  already  chosen,  that  feature  which 
maximizes  the  criterion  functiont  6.  Here  G could  be  a 
goodness  measure  of  the  single  feature,  or  a measure  of  the 
correlation  of  that  feature  with  the  k features,  or  a 
weighted  sum  of  the  two,  etc.  C203.  In  multiclass  schemes, 
certain  modifications  of  these  methods  have  been  suggested, 
t?03,e.g.  the  k+1  th.  feature  may  be  chosen  as  that  which 
best  separates  the  class  pair  most  confused  by  the  k feature 
set. 

The  iterative  refinement  policy  consists  of  replacing 
some  m out  of  the  k features  chosen  at  the  k th.  step,  by 
some  other  features,  until  no  improvement  in  G is  possible. 
Then  the  k*1  th,  feature  which  causes  the  largest  increase 
in  G is  added  to  this  refined  set,  and  the  search  proceeds 
to  the  k*1  th.  stage. 

The  recursive  method,  also  called  a dynamic  programming 
method  [33,  may  be  described  in  the  following  manner.  At  the 
k th.  step  we  have  found  the  ''best'  k out  of  N features.  To 
find  the  best  k* 1 features,  evaluate  all  k«1  feature  sets 


9 


1 obtdifitfd  in  the  (ollouing  two  ways;  (i  ) the  best  k feature 

set  found  along  with  any  feature  froit  the  remaining  feature 
set  (without  replacement  policy)*  and*  (ii)  any  set  of  m out 
of  the  k feature  set  together  with  the  best  set  of  (k*1-m) 
features  out  of  the  remaining  N-k  features*  the  latter  being 
defined  recursively  in  a similar  manner.  The  best  k*1 
features  out  of  h*  are  the  features  evaluted  in  (i)  and  (ii) 
with  the  largest  G value. 

The  above  methods  of  selecting  features  have  some 
drawbacks  when  used  in  a multiclass  recognition  system.  Most 
of  the  geometric  measures  of  goodness  are  inadequate  if  more 
than  two  classes  have  to  be  distinguished*  or  if  the  classes 
are  multimodal,  for  multiclass  and  multimodal  problems*  a 
single  geometric  measure  usually  cannot  describe  the 
goodness  of  a feature  set.  By  using  a single  criterion 
function*  G*  it  is  being  assumed  that  a Bayes  classifier 
will  be  used  to  oiscriminate  the  classes.  However*  from  the 
earlier  discussion*  it  is  seen  that  a Bayes  classifier  may 
be  too  difficult  and  complex  to  implement.  Thus*  these 
feature  selection  methods  do  not  connect  the  selection  task 
with  that  of  specifying  the  form  of  the  classifier  in  which 
they  would  be  used.  Another  factor  to  be  considered  is  that 
of  optimum  dimensionality.  For  a given  sample  sire  per 
class*  one  may  not  want  to  use  more  than  say  n features. 
However*  the  best  n features  to  discriminate  one  class  pair* 
may  be  different  from  the  best  set  to  distinguish  between 
another  pair.  By  assuming  that  some  n features  are  used  in  a 
single  step  to  make  an  M-way  decision  (for  an  M-class 
problem)*  one  is  forced  to  compromise  the  feature  set 
choice.  A better  use  of  the  features  might  be  to  split  the 
decision  into  several  stages*  where  at  any  stage  one  set  of 
classes  is  distinguished  from  another.  Thus*  at  eacfi  step* 
one  could  use  the  n features  best  suited  for  that  task. 


10 


r 

I 


Another  limitation  of  the  above  search  methods  is  that  they 
do  not  take  into  account  the  measurement  c os t -pe r f or m a nc e 
tradeoff  that  has  to  be  made  in  many  applications. 

Multistage  schemes  have  been  proposed  in  the  literature 
to  achieve  the  pe r f or ma nc e-c omp I e k i ty -co s t tradeoffs  that 

must  be  made  in  practice.  In  such  schemest  a sequence  of 

feature  measurements  are  taken  from  the  samplet  and  a 

sequence  of  'guesses'  made  regarding  the  sample  label.  The 
various  schemes  proposed  differ  in  the  strategy  used  to 
select  the  next  feature  to  be  observed,  and  the  decision 
function  used  to  refine  the  'guess'  cf  the  label. By  breaking 
the  multiclass  decision  making  problem  into  a series  of  less 
complex  decisions,  the  complexity  in  the  design  of  each  such 
decision  is  reduced.  The  feature  to  be  measured  at  any 
stage,  can  be  chosen  depending  on  the  classes  to  be 

distinguished.  Moreover,  if  the  cost  of  taking  the  next 
measurement  is  unjustifiable  in  view  of  the  gain  in 
accuracy,  the  measurement  process  car  be  terminated,  and  a 
'good'  decision  made  on  the  basis  of  observations  taken. 

1.3.  Literature  On  Multistage  C las s i f i cat  ion 

Multistage  decision  schemes  and  hierarchical  classifiers 
have  been  extensively  used  to  solve  recognition  problems  and 
experience  with  such  methods  is  discussed  in  several 
papers.  This  body  of  work  can  be  grouped  into  three  broad 
categories: 

(1)  Conversion  of  decision  tables  to  optimal 
dec i si  on  trees, 

(?)  Sequential  pattern  classification  methods,  and 

(3)  Hierarchical  c lass i f icat ion  methods. 

There  has  been  a considerable  amount  of  work  reported  in 


I 


1 1 

converting  decision  tables  into  optimal  decisiori  trees 
[ V , 1 5 , 24 , ? 5 , Z6 3 . In  these  cases,  the  criterion  of  optimality 
is  the  average  number  of  nodes  traversed  to  classify  a 
sample,  or  the  total  number  of  ncdes  in  the  treel25]. 

% 

However  these  table  conversion  methods  do  not  address  the 
problen.  of  trading  efficiencyf  as  measured  by  the  average 
number  of  nodes  traversed)  for  m i s c I as s i f i c a t i on  cost,  since 
they  assume  that  the  patterns  can  be  unambiguously 
distinguished.  Knuth[93  considers  a dynamic  programming 

approach  to  construct  an  optimal  binary  search  tree  for 
alphabetical  key  words,  where  at  each  node,  one  stores  a key 
word,  and  the  test  at  that  point  consists  of  seeing  if  the 
sample  key  matches  that  key,  is  less  than  that  key  or 
greater  than  it.  Meisel  and  M i c ha  I a po I ou s 1 1 03  discuss  the 
use  of  a recursive  procedure  to  arrange  a set  of  piecewise 
constant  boundaries  in  metric  space,  so  as  to  minimize  the 
average  number  of  comparisons  needed  to  classify  a sample. 
However,  they  assume  that  the  feature  space  has  already  been 
broken  up  into  the  various  decision  regions.  Thus,  the 

algorithm  rearranges  the  order  of  the  tests  optimally, 
without  affecting  the  m i s c I a s s i f i c a t io n rate. 

StoeffelllS]  and  RellC153  use  a two  phase  method  for 
designing  a hierarchical  classifier  for  character 

recognition  applications.  The  basic  features  are  binary 
(e.g.  presence  or  absence  of  some  structural  component).  In 
the  first  phase,  these  features  are  used  to  synthesize  more 
comple*  features  which  are  basically  binary  vectors  with 
' “ - 'dont  care'  b ii  t s wherein  each  bit  represents  the  presence  or 

absence  of  a given  property. This  collection  of  vectors 
(called  prototype  vectors)  are  used  to  generate  a decision 
> table  for  the  mu  1 1 i c ha r a c t e r recognition  problem.  In  the 

second  phase,  the  decision  table  is  converted  into  a 

decision  free  having  a minimum  path  length.  Bellf153  gives  a 


i 


heuristic  algorithm  for  deriving  the  prototypes  (which  he 
refers  to  as  decision  rules).  The  'decision  rules'  Tor  each 
class  are  generated  automatically  using  design  samples*  The 
procedure  starts  with  the  most  general  rule  with  all  don't 
care  bits.  i.e.  accepts  all  samples.  Then  the  bits  are 
sequentially  specifiedt  and  the  merit  of  a rule  is  measured 
by  t 

T . p - U . q 

where,  p,  q are  constants  determined  empirically  and 
varied  during  the  procedure  in  a certain  way,  T is  the 
number  of  samples  from  the  target  class  which  the  rule 
accepts,  and  U is  the  number  of  non-target  samples  accepted 
by  it. 

One  of  the  early  investigations  of  sequential  methods 
for  two  class  problems  was  carried  out  by  Wald  C?83.  This 
method  assumes  that  an  infinite  series  of  measurements  can 
be  taken  sequentially  from  the  test  sample*  After  taking 
each  measurement,  the  ratio  of  aposteriori  likelihoods  of 
the  two  classes  is  compared  with  two  threshold  values.  If 
the  ratio  falls  between  the  two,  the  measurement  process  is 
continued,  while  if  it  falls  to  one  side  or  the  other  of 
both  thresholds,  the  sample  is  classified  into  one  or  the 
other  class.  A generalization  of  this  method  for  the 
multirlass  case  was  presented  in  153.  It  can  be  shown  that 
for  the  two  class  case,  the  Wald  sequential  ratio  test 
requires  the  minimum  average  number  of  measurements  for  a 
given  error  rate  per  class. 

fu  153  discusses  the  problem  of  optimal  sequential 
decision  making  and  feature  ordering  when  the  number  of 
features  is  finite.  He  uses  a dynamic  programming 
formulation  to  compute  the  optimal  policy  for  a fixed 
feature  ordering,  and  the  optimal  feature  ordering  and 


1 ^ 


dec 

i s 1 on 

po  1 

i c 

y . 

for  a given  nu  mbe  r 

cf 

feat 

ur e s , N 

• 

This 

method  c ons  i 

s t 

s 

of  starting  at  t 

last 

stage 

o f 

the 

de  c 

i s i on 

m a k 

1 n 

9 

process  and  evaluati 

ng 

the 

opt i m a 1 

po 

1 i t y 

tor 

d 1 1 

p 0 s s 

1 1) 

1 e 

histories  of  measur 

ement  s . 

The  dec  i 

Sion 

a t 

• hdt  point  is  that  of  either  classifying  the  sample  tr>tc»  one 
of  f'  classes,  or  taking  the  next  measurement.  This  choice  is 
based  on  the  minimum  of  the  risk  that  would  be  incurred  if 
the  classification  were  made,  and  the  cost  incurred  if  the 
measurement  process  were  continued.  Thus,  the  decisions 
computed  at  each  stage  optimize  the  weighted  sum  of  risk  and 
measurement  cost  for  any  sample  that  reaches  that  stage. 

MartflS]  considers  the  problem  of  finding  an  admissible 
strategy  (tfiayes  optimal)  for  a probabilistic  decision 
tree. I he  tree  is  binary,  and  each  terminal  node  represents  a 
joint  hypothesis  comprised  of  the  suthypotheses  denoted  by 


the  arcs 

on  the 

path  to 

that  terminal. 

Thus 

each 

measurement 

p r 0 V i (f  e s 

an  dp  os 

te  r i or  i 

likelihood  of 

each 

0 f 

a p a i r of 

S ob h y p o t h e s e s . Under  the  assumption  that  the  likelihood  of  a 
composite  hypothesis  depends  only  on  the  measurements  taken 
along  that  path,  and  the  assumptions  that  the  measurements 
are  conditionally  inoependent  and  irdependent,  Hart  shows 
that  the  a posteriori  likelihood  of  the  joint  hypotheses 
below  a node  can  be  used  as  a heuristic  to  order  the  nodes 
which  are  candidates  for  traversal.  The  resulting  algorithm 
IS  arlmissible  and  optimal. 

fukunagaldl  has  proposed  the  use  of  branc h-and-bound 
melfiods  for  reducing  the  distance  computations  in 
nearest-neighbour  schemes.  His  method  consists  of  making  a 
hierarchical  partitioning  of  the  design  sample  set  and 
storing  at  each  node  of  the  hierarchy  the  mean  vector  and 
radius  of  the  samples  represented  by  that  node.  These  two 


parami'iers  are  used  to  put  a lower  bound  on  the  minimum 
distanie  between  a test  sample  and  any  sample  in  the  design 


1 A 

set  at  the  node.  It  the  current  minimum  distance  of  the 
test  sample  from  any  design  sample  is  less  than  this  lower 
bound,  the  entire  set  of  design  samples  at  that  node  can  be 
discarded  from  consideration  as  nearest  neighbours.  In  this 
way,  the  nearest  neighbour  can  be  found  with  a smaller 
average  number  of  distance  computations. 


The  method 

s proposed 

in  the 

li terature 

for 

designing 

hierarchical 

classifiers 

which 

minimize 

the 

sum  of 

measurement  cost  and  risk 

, are 

mu s t ly  top 

down 

heuristic 

methods  which  are  suboptimal  solutions  to  the  problem, 
CA , 1 A , 16] .Mat t son  and  0ammannC163  consider  the  use  of  linear 
discriminants  to  detect  and  code  the  clusters  in  a 
multiclass  problem,  wherein  the  node  decisions  are  binary 
and  the  thresholds  are  set  manually  by  inspecting  the 
scatter  of  the  s amp  I es ( I abe I led  ) along  that  axisl  the  Fisher 
direction).  FriedmanCA]  describes  a nonpa r ame t r i c 
classification  method  that  consists  of  splitting  a set  of 
labelled  design  samples  into  successively  smaller  sets, 
until  all  samples  in  each  of  the  terminal  sets  belong  to  the 
same  class,  or  have  a membership  less  than  some  specified 
constant.  The  feature  chosen  at  each  node  to  split  the 
samples  is  that  which  has  the  largest  value  of  the 
Kolmogorov  variational  distance  between  the  two  class 
populations  projected  along  that  feature  axis,  or  along  the 
Fisher  direction  using  all  features. The  threshold  at  that 
node  is  selected  to  maximite  this  variational  distance.  For 
multiclass  schemes,  two  methods  are  suggested.  The  first 
uses  M such  trees  where  the  i th.  tree  separates  the  i th. 
pe  class  from  the  other  classes.  The  test  sample  is  classified 

by  passing  it  down  all  M trees  and  labelling  it  with  the 
class  which  has  the  largest  number  of  samples  in  the 
'buckets'  (terminal  nodes)  into  which  the  test  sample  falls. 
The  second  scheme  uses  a genera  I ieat  ion  of  the  Kolmogorov 


1 


distance 

r f* 

nd 

a 

single  t 

ree 

t o 

rra  k e 

the  classif 

teal  tun. 

Ihf 

g e n e r a 1 i 

/V(i 

d 1 

St 

an  1 (’  is 

Xho 

V cj  r 

^ cjo  c e 

of  the  t fi e M 

e s t I m ,1 

t f I-* 

c u m u 1 a t 1 

ii  1 

St 

1 t t)u  t i on 

f unc  t 

i OM  s 

along  the 

p .1  1 t I c u 

1 .1  r 

feature  axis.  The  feature  whicti  has  the  largest  variance 
value  is  the  feature  chosen  to  split  the  sample  set. 

Wu  riA]  describes  a top  down  heuristic  method  tor  ttie 
design  of  a hierarchical  classifier.  Starting  uitti  tht  <tef.  ign 
samples  from  all  the  classes^  the  procedure  successively 
partitions  the  samples  (classes)  using  a non-sup ervised 
clustering  algorithm,  and  evaluating  all  'likely'  feature 
subsets. The  feature  subset  chosen  for  the  node,  di,  is  that 
liaving  the  smallest  value  of  the  evaluation  function,  utiict' 
isoftheform, 

c; 

t(di)  = -T(di)  -K.e(di)  E(dl*j) 

In  the  above  equation  E is  the  evaluation  function, 
e(di)  is  the  error  rate  at  di  using  that  feature  set,  T(rii) 
is  a measure  of  the  time  needed  to  make  the  decision  at  di, 
K is  a weighting  term,  and  ci  are  the  number  of  descetidant 
nodes  of  di,  obtained  via  the  clustering  procedure.  EtriMj) 
are  the  evaluation  functions  of  the  descendant  nodes,  and 
since  these  nodes  have  not  yet  been  expanded,  Ffdlrii  is 


assumed  to  be 

a 

sum  of  two  terms. 

via* 

> 

E ( d t ♦ j ) = 

-T 

(m,nj)  -k.C 

where  T(m, 

nj 

) is  tfie  computation 

time  at 

that 

node  , 

assumed  t c>  be 

a 

function  of  the  total 

set  of 

features 

9 

m , 

a nd 

the  number 

0 

f classes,  nj,  in 

that 

cluster. 

The  err 

o r 

.1 1 

(f  1 ♦ j 

is  a s s li  m e 

d 

to  be  a constant. 

C. 

Thus,  t h 

is  t o fi 

down 

proc 

e d u r e of  t 

re 

c generation  cons 

i s t s 

of  sp  1 i t t 

i ng  d 1 

li  S 

inq 

1 he 

feature  se 

t 

with  the  smallest 

value  of  the  evalu 

a t 

i o n 

f un  ( 

tion  f(di) 

f 

and  then  re|)eat  inq  t he 

process 

on  eat 

h 

o f 

the  descendant  nodest  until  each  terminal  set  contains 
samples  from  a single  class* 


1.4.  Scope  Of  This  Research 

The  preceding  discussion  of  the  reported  work  in 
multistage  classification  highlights  the  following  factors: 

(1)  Post  of  the  work  on  decision  trees  has  dealt  with  the 
task  of  converting  decision  tables  to  optimal  trees.  These 
studies  assume  that  the  pattern  classes  are  perfectly 
separable,  and  hence  unambiguously  separated  by  the  decision 
table.  Therefore,  they  seek  to  minimize  the  number  of  tree 


nodes,  the  average  path  length,  or  seme  combination  of  these 
two  factors.  In  statistical  pattern  classification,  however, 
the  classes  do  'overlap',  and  the  error  rate  of  the  tree  is 
an  important  consideration.  The  crucial  part  of  the  design 


IS  that  of  designing  the  decisi 
deterrines  the  error  rate.  Moreov 
samples,  one  has  the  added  comple 


cn  table,  since  this 
er  , for  finite  design 
xity  of  selecting  the 


features  which  will  be  used  in  the  classifier.  The  table 
conversion  algorithms  do  not  address  themselves  to  these  two 
prut' I e ms  viz.,  feature  selection  and  misclassification 


(Z)  There  has  been  some  work  done  on  designing  optimal 
sequential  recognition  procedures  for  multiclass  problems. 
Sequential  methods  of  recognition  differ  from  hierarchical 
classifiers  both  in  the  ordering  imposed  on  the  sequence  of 
feature  measurements  and  the  ordering  on  the  set  of  possible 
class  labels.  Sequential  schemes  impose  a linear  ordering  on 
the  features,  ana  in  most  cases  there  is  no  particular  order 
on  the  class  labels,  i.e.  any  class  can  be  accepted  at  any 
stage  of  the  measurement  process.  In  hierarchical  methods. 


1 7 


the  features  as  well  as  the  class  labels  are  order 
h i e r a r c h i c a I I y . Thus  at  each  step  of  the  m e a s u r e ire  n t » some 
classes  are  rejected  from  consideration  *as  candidates  for 
the  test  sample's  label.  Dynamic  programming  has  been  used 
to  find  the  optimal  Decision  policy  and  feature  ordering  for 
sequential  schemes.  This  work  has  not  been  extended  further 
perhaps  because  of  the  exponential  growth  in  computing 
required  for  designing  such  schemes,  as  the  number  of 
features  and  classes  increase. 


(3)  The  methoas  of  designing  hierarchical  classif 
reported  in  the  literature  are  heuristic  methods  making 
claims  to  optimality  even^  uader  restrictive  assumpti 
There  has  been  little  study  done  on  the  applicability 
optimisation  , methods  such  as  dynamic  programming 
b r a nc h - and- bou nd  techniques  for  the  design  of 
classifiers.  ! 


This  research  was  aimed  at  developing  theoret 
approaches  to  the  analysis  and  synthesis  of  a broad  clas 
multistage  recognition  schemes,  including  hierarch 

IT 

classifiers,  and  studying  the  use  of  optimisation  met 
for  their  design. 

Chapter  2 develops  a general  theoretical  model 
state-space  model,  to  describe  the  behaviour  of  multis 
schemes.  It  is  shown  that  most  of  the  methods  proposed 
the  literature  can  be  described  in  terms  of  this  model, 
model's  generality  allows  one  to  define  new  types 
classification  schemes.  In  this  model,  a state  consists 
measurement  set,  and  a set  of  possible  classifications 
the  sample.  An  edge  in  the  graph  represents  the  action 
observing  a particular  feature!  or  a set  of  features), 
has  a cost,  the  measurement  cost,  associated  with  it. A 


i e r s 
no 
o ns  . 
o f 
and 
such 


i c a I 
s of 
i c a I 
hods 

* a 

t age 
i n 
The 
o f 
of  a 
o f 
o f 
and 
g oa  I 


18 


state  is  any  state  which  contains  one  or  zero  class  labels 
as  the  set  of  possible  classifications.  Depending  on  the 
manner  of  definition  of  decision  making  costs,  two  types  of 
admissible  strategies  are  defined,  viz.,  an  S-admissibte 
strategy,  and  a B- admissible  strategy.  The  goal  cost  is 
assumed  to  be  a weighted  sum  of  measurement  cost  and 
misclassitication  risk. Optimality  aho  admissibility  of  the 
strategies  can  be  proved  under  certain  conditions. 

Ihe  heuristics  used  in  such  searches  employ  bounds  on 
the  misclassificdtion  risk,  to  decrease  the  nodes  to  be 
searched  in  the  state-space  graph.  Methods  of  computing 
bounds  are  derived  for  several  parametric  and  nonpa r ame t r i c 
c I a s s i f i c a t i on  schemes.  This  general  model  has  the  advantage 
that  nonpara metric  schemes,  such  as  the  nearest  neighbour 
rule,  using  Euclidean  distance,  or  similarity  measures  for 
nonrretric  features,  can  also  be  modelled  as  a state  space 
search.  New  hounds  on  such  distance  measures  are  derived, 
ana  it  is  shown  that  these  bounds  can  lead  to  a substantial 
reduction  in  the  number  of  distance  computations  needed  to 
find  the  nearest  neighbour. 

In  the  above  state-space  model,  a state  consists  of  any 
subset  of  features,  and  any  subset  of  class  labels.  If  all 
stales  and  all  possible  state  transitions  are  considered, 
the  model  rapidly  becomes  too  large  and  complicated  to 
liandle.  One  way  to  restrict  the  states,  and  state 
transitions,  is  to  make  the  state-space  graph  explicit,  i .e  . 
to  explicitly  define  the  possible  ordering  of  feature 
measurertients.  Such  a graph  can  still  be  searched  using  an 
S-udmissiblc  or  B-admissible  strategy.  A hierarchical 
classifier  is  a particular  type  of  graph  in  which  a single 
path  is  followed  to  classify  a test  sample,  and  where  at 
each  stage  in  the  decision  making  process,  some  classes  are 
rejected  from  consideration  as  possible  labels  of  the 


I 


1 9 

sample  . 

Charter  3 derives  a nuitiher  of  properties  of  hierarchical 
classitierst  and  examines  in  some  detailf  those  classifiers 
whose  node  decisions  are  statistically  independent.  Bounds 


are  derived 

f 0 

r the  error 

in  tree  performanc 

e when 

such 

a n 

a s sump  t i on 

i s 

made.  It 

is  also  shown 

that  w 

hen 

the 

independence 

d 

ssumption  is 

valid,  an  upper 

bound 

on 

the 

total  tree  per 

formance  can  be  derived  in 

terms 

o f 

the 

performances 

O 

f the  classi 

f i er  s used  at  the 

nodes 

0 f 

the 

decision  tree.  The  tree  design  problem  is  complex  even  under 
this  independence  assumption.  Two  'negative*'  properties  bear 
out  this  fact.  Firstt  it  is  proved  that  optimi^ing  the 
performance  at  each  node  does  not  necessarily  optimize  the 
total  tree  performance.  Secondly,  choosing  at  each  node, 
that  feature  which  makes  the  node  decision  with  the  least 


error,  does 

not  necessarily  constitute  the 

best 

c h 0 i 

c e 

o f 

features  to 

be  used 

at  the  tree 

node  s. 

The  tree  des  ign 

prob lem  can 

be  simplifi 

ed 

by  us 

> og 

d 

three  phase 

ap  pr  oa  c h 

proposed  i n 

chapter  A. 

It 

invest 

i g a t e s 

the  use  of 

dyn  am  ic 

programming 

in  solving 

the  f 0 1 

low 

i n q 

three  design  tasks: 

(1)  Design  of  the  optimal  policy  at  each  tree  node, 
given  the  tree  structure,  i.e.  the  tree  skeleton  and 
the  features  to  be  used  at  each  node. 

(2)  Design  of  the  optimal  feature  me;surement  and 
decision  policy,  given  only  the  tree  skeleton,  i.e. 
the  classes  rejected  at  each  node  and  the  node 
hierarchy. 

(3)  Design  of  the  optimal  tree  structure,  i.e.  the 


skeleton  and  the  choice  of  features  to  be  measured 
at  each  node  given  that  at  each  node,  the  decision 
is  a maximum  likelihood  rule  using  prior  class 


20 


probab  il  it  ies. 

While  the  riyriamic  programming  algorithms  for  solving  (1) 
and  (2)  are  optimal  for  any  feature  distribution,  the 
algorithm  presented  to  solve  problem  (3)  is  optimal  only 
under  certain  conditions. 

The  dynamic  programming  methods  rapidly  become 
Cumbersome  in  storage  and  c ompu t a t i o ra I requirements  as  the 
number  of  features  and  classes  to  be  considered,  increase. 
Chapter  5 studies  methods  of  reducing  the  measurement  and 
derision  complexity  at  the  expense  of  degraded  performance. 
This  chapter  describes  methods  of  reoucinq  the  complexity  in 
determining  the  optimal  decision  policy  at  each  node,  and 
the  optimal  feature  to  be  measured  at  each  node.lt  proposes 
techniques  for  clustering  decision  rules  when  the  features 
used  at  the  tree  nodes  are  discrete  and  non-metric.  A 
branch -and  bound  algorithm  for  discarding  rules  which  are 
known  to  be  subof)timal  is  also  suggested.  The  problem  of 

assigning  features  to  the  nodes  of  a given  tree  skeleton  is 
also  considered.  A dynamic  programming  formulation,  and  a 
b r a nc h -a nd -bou nd  method  for  solving  this  problem  are 
described.  The  use  of  feature  ranking  at  tree  nodes  is  an 
additional  method  of  reducing  the  computations  involved  in 
tree  design. 

Chapter  6 explores  the  relationship  between  sample  si;e, 
the  number  of  classes,  and  dimensionality,  as  it  affects 
tree  design.  In  particular,  it  is  shown  that  for  small 
sample  sizes,  by  breaking  the  M-class  problem  into  a 
hierarchy  of  two-class  problems,  the  degradation  in  the 
estimated  performance  from  the  true  performance  of  the 
classifier,  can  be  reduced.  This  result  is  illustrated  for 
the  case  of  discrete  features.  An  expression  for  the  mean 
accuracy  of  a decision  tree  for  an  ft  class  problem  is 


2 1 

derive  a.  This  analysis  allows  one  to  study  the  bcliavinwr  ot 
(he  classifier  performance  as  a function  of  sample  % i n-  , 
quantiration  complexity  and  the  number  of  claSs^st  fhe 

existence  of  an  optimum  quantization  complex»ty  for  a qiven 
sample  size,  discovered  by  earlier  researchersC33,34J  for  a 
two-class  problem,  is  shown  to  tiold  for  an  M-class  decision 
tree  recognition  scheme.  , 

In  summary,  a general  model  of  multistage  multiclass 
recognition  schemes,  which  trade  accuracy  for  cost  has  teen 
formulated  and  analyzed.  Systematic  methods  for  the  design 
of  hierarchical  classifiers  have  been  investigated,  as  these 
represent  important  special  cases  of  multistage  schemes. 
This  research  has  leo  to  new  results  and  new  insights  or> 
multistage  r.  lassification,  and  proved  the  usefulness  of 
optimization  methods  in  designing  hierarchical  classifiers. 


I 


22 


2.  General  ^'Odel  of  Multistage  Classification 


This  chapter  presents  a formulation  of  the  multistage 
multiclass  pattern  recognition  problem  as  a state  space 
search.  Two  classes  of  strategiest  called  S-admissible  and 
B-admissible  strategies,  are  presented  as  being  optimal  for 
two  types  of  pattern  classification  tasks. 

An  S-admissible  strategy  *inds  the  minimum  cost 

goal(node)  in  the  state-space  graph  when  the  goal  cost 
depends  only  on  the  features  measured  on  the  path  to  that 
node  in  the  graph.  The  search  strategy  uses  a heuristic 
function  to  decioe  the  node  to  be  traversed  next,  and 
terminates  when  a goal  node  is  reached. 

A different  type  of  strategy  is  needed  when  the  goal 
cost  depends  upon  ALL  features  measured  on  the'  test  sample. 
The  bayes  risk  is  an  example  of  such  a cost  function.  The 
concept  of  a b-admissible  strategy  is  defined  for  this  kind 
of  a searcti  problem.  A B-admissible  algorithm's  execution 
seyuer.ee  can  be  regarded  as  composed  of  two  parts,  the  first 
of  which  uses  a heuristic  function  tc  decide  the  node  to  be 

traversed  next  , while  the  second  part  uses  an  upper  bound  on 

the  goal  cost  to  reject  certain  c I ass i f i ca tory  decisions. 
The  algorithm  terminates  when  all  classes  are  rejected 
except  one,  this  being  the  B-optimal  goal.  In  some 

situations,  it  could  terminate  with  the  reject  class,  i .e . 
none  of  the  classes  is  accepted.  A theorem  on  the 
optimality  of  B-admissible  algorithms  is  presented. 

Viewing  the  multistage  classification  problem  as  a state 
space  search  allows  one  to  describe  a large  class  of 

parametric  and  non -parametric  methods  in  a single  framework. 
The  search  efficiency  of  the  S-admissible  and  B-admissible 
strategies  depends  on  the  'tightness'  of  the  bounds  on  goal 
costs  used  by  them  to  evaluate  alternative  search  paths. 


Technique*;  for  computing  upper  and  lower  bounds  on  a goal 
cost,  for  use  in  admissible  search  strategies,  are  derived 
for  the  following  special  cases: 

. Statistically  independent  discrete  features. 

. Tree  dependent  discrete  features. 

. Nearest  neighbour  classificat,icn  using  Euclidean 
distance  ( nonparamet r i c ) • 

. C t a s s i f i c a t i on  using  similarity  measure  between 
binary  vectors  (no np a r ame t r i c ) • 


2.1.  State-Space  Model 

A multistage  multiclass  decision  making  process  can  be 
modelled  as  a search  for  a minimum  cost  goal  node  in  a 
state-space  graph,  6,  which  is  a 7-tuple, 


G — (S ,E  ,f ,W,c,r  ,T) 
where, 

S : is  the  set  of  s t a t es ( no de s ) in  the  graph, 

E : is  the  set  of  possible  tr ans i t i on s ( e dge s ) 

bet  ween  states  in  S, 
r : is  the  total  set  of  features, 

W : is  the  total  set  of  class  labels,  W = ( 1 , 2 , . . . MT , 

where  M is  finite, 

c : is  a non-negative  real-valued  cost  function  on 

the  edges  in  E,  and  represents  the  measurement 
c os  t , 

r : is  a real-valued  function  on  the  'goal'  nodes 

in  S and  represents  the  misclassification  loss 
incurred  by  making  the  c I a s s i f i ca t o r y decision 
associated  with  that  noc^e,  and, 

T : is  the  decision  strategy  used  for  deciding  the 


24 


i 


node  to  be  traversed  ne  »t  ; it  '•.ould  be  a 
function  of  all  measurements  taken  on  the  test 
s am  pi  e . 

A state  (node)  s€Sy  is  a tuplet  {Fs«Ws>»  wheret 

fsC  Fisa  subset  of  the  features  that  are  measured 
when  that  node  is  traversed*  and* 

WsC  W is  a subset  of  the  class  labels*  and  denotes 
the  possible  classifications  that  can  be  made 
on  any  path  in  the  graph  passing  through  s. 

A goal  node*s*  is  one  for  which* 

Fs  = 0 and  lWsl<  1*  i.e.  only  one  c I a ss i f i ca t o r y 
decision  is  possible*  or  a 'reject'  decision. 

The  observed  values  of  the  feature  set  Fs  are  random 
variables.  The  decision  making  process  consists  of  starting 
with  the  initial  node  of  6*  taking  the  feature  measurements 
associated  with  that  node*  and  using  the  strategy*  T to 
decide  which  of  the  successors  of  the  node  to  traverse  next. 


This  proce 

ss  is  repeated  for 

an  y 

node 

selected 

for 

'expan  s ion' 

. If  N(s«)  denotes  the 

s et 

0 f 

states 

(nodes  ) 

on 

the  path  to 

a goal  node*  s**  and  c 

(h  (s*  ) ) 

i s 

the 

sum  of 

arc 

costs  a 1 ong 

that  path*  and  r(s«)* 

the 

goal 

risk* 

then 

the 

total  cost  of  making  the  decision  s*  is  defined  by* 

f($*)=  c(N(s*>)  * r(s*>  ................(2.1) 

Based  on  the  form  of  the  risk*  r(s*)»  one  can  define  two 
broad  categories  of  multistage  c I a s s i f i c a t i on  schemes: 

. the  risk  r is  of  the  form*  r(s*/Xs)*  where  Xs  is  the 
set  of  measurements  on  the  path  to  s*. 


?s 

. the  risk  r is  a function  of  ALL  measurements,  riot 
just  those  on  the  path  to  S*.  If  k denotes  a stacje 
vuridLle  and  XL  the  features  observed  till  staje 
k,  then  r is  of  the  form  r(s*/Xk)  and  it  could  vary 
as  more  features  are  observed,  possibly  on  ether 
paths  that  don't  lead  to  s*» 

t 

The  following  sections  describe  admissible  strategies 
for  each  of  the  above  types  of  classification  schemes. 

2.2.  S-aiimissiblt  Strategies 

An  S-admissible  strategy  is  defined  as  one  which,  among 
dll  possible  goal  states  in  G,  terminates  with  the  goat 
state  s*,  for  uhirh  the  value  of  f(s*)  is  minimum,  where, 

f ( s •)  >=c  ( N (s  * > > r ( s * / X s ) (2.2) 

and  Xs  are  the  measurements  on  the  path  to  s*  in  6. 

A search  strategy  is  defined  in  terms  of  an  evaluation 
function  which  it  uses  to  order  the  set  of  nodes,  currently 
available  for  'expansion'  (i.e.  the  set  of  'open'  nodes). 
This  evaluation  function,  f(n),  for  node  n,  is  defined  as, 

f(n)=g(n)*h(n)  + l (n) 

where  g(n)  is  the  cost  from  the  initial  node  to  node,  n, 
h(n)  is  an  estimate  of  the  arc  cost  from  n to  a goal  node 
accessible  from  n,  and  Ifn)  is  an  estimate  of  the  risk  of  a 
goal  node  accessible  from  n.  Algorithm  S described  below, 
uses  the  function  f to  select  the  order  of  expansion  of  tt»e 
nodes  in  G.  This  algorithm  is  similar  to  the  ordered  search 
method  proposed  in  i12D  but  differs  from  it  in  that  we  also 
use  a lower  bound  on  the  decision  r isk  in  the  evaluat  ion 


26 


funct ion. 

2.2.1.  Algorithm  S 

Let  OPEN  refer  to  a set  of  OPEN  nodes*  i.e.  candidates 
for  expansion.  Let  CLOSED  refer  to  the  set  of  nodes  already 
traversed. 

Step  0 : Set  OPEN  to  contain  the  initial  state*  cX  *W>. 

Set  CLOSED  to  the  null  set. 

Step : For  each  node  n in  OPEN  ccmpute  f(n>*  as* 

f (n )=  g(n)  ♦ h(n)  ♦ l<n) 

where  * 

g(n)  = c(N(n)) 

h(n)  = 0 if  n is  a goal  ncde* 

4;  I'^in.C  c ( N ( j ) ) -c  ( N (n ) ) ] if  n is  not  a goal. 

^ e Wn 

l(n)  = r(n/Xn)  if  n is  a goal  node*  Xn  being  the 

features  observed  cn  the  path  to  n* 

^ Kin  t Min  r(j/Xn*Y)  D when  n is  not  a goal* 
Wy,  Y 

j is  a goal  below  node*n*  V the  set  of 
features  not  yet  observed  and  Wn  the  set 
of  goal  states  accessible  from  n. 

Step  2 : Let  n*  be  the  node  having  minimum  f(n). 

Remove  n*  from  OPEN  and  put  it  in  CLOSED.  If  n^ 
was  a goal  node*  STOP  with  n*  as  the  optimal  goal 
else  take  the  measurement  x associated  with  n* 
and  put  the  successor  nodes  of  n*  in  OPEN* 
then  repeat  step  1 » 

Theorem  1:  Algorithm  S is  S -adm i ss i b I e . 


27 


I’ roof:  Ihf  optimjlity  of  <t  yodl  node  n*  put  in  Cl.  OStli  L)> 

dlgorithm  s,  follows  from  the  fact  that  the  evaluatror* 
funition,  ffn)  or.dei' estimates  the  test  of  ANY  goal  nodr 
below  n.  Since  the  ealue  of  f(n*>  is  less  than  f(n)  for  any 
OPLN  tuide,  it  fallows  that  n«  has  a smaller  value  of  f()  as 
given  by  e goal  ion  (?.1)  than  any  other  goal  node.  Hence,  it 
is  optimal. 

Theorem  2 shows  that  the  lower  bound  on  the  risk  of  a 
goal  is  a n on -d e c r e a s i ng  function  on  the  sequence  of 
observations  taken  along  the  path  to  that  goal.  This  fact  is 
used  along  with  a consistency  assumption  on  the  h function, 
to  prove  in  Theorem  3 that  the  evaluation  function,  f(n)  is 
non-decreasing  on  the  sequence  of  nodes  expanded  tpy 
algorithm  S. 

Theor^_2:  Let  Xn  be  the  measurements  on  the  path  to  a 
goal  noce,  n.  let  XI,  X2  be  subsets  of  Xn,  such  that  X 1 X2, 

Let  X denote  the  complement  of  X with  respect  to  Xn.  Then, 
Min  r(n/X2,lt2)  ^ Min  r(n/Xl,Xl) 

Let  X3  = X2nJXl,  i.e.  the  set  of  features  in  x? 
and  not  in  XI.  T hen , 

MJn  r (n/  X2  ,X  2)  = Min  [ r(n/Xl,x3,X2)  D 

Min  r(n/Xl,Xl)  = Min  [ Min  r ( n / X 1 , X 3 ,7 2 > ] 

X. 

Since,  Min  r (n  / X 1 , X 3 , X 2 ) ,<  r ( n / XT  , X 3 , X 2 > for  any 

X 5 

particular  value  of  X3  on  the  right  hand  side  of  the 

above  inequality,  it  follows  that, 

Min  rfn/Xl,Xl>  ^ Min  r(ri/x2,X?) 

X,  Xl 

D ejf  j_n  itLonIT2]:  Tfie  heuristic  function,  fi(n),  which 


28 


estimates  the  arc  cost  from  n to  a goal  node  • is  said  to  be 
consistentf  iff  for  any  two  nodeSf  i tj , such  that  there  is  a 
path  from  i to  j of  costt  c(i»j)* 
h(i)  - h(j)  ^ c(itj)« 

Theorem  3 : If  the  function  h used  by  Algorithm  S is 

consistentf  then  the  evaluation  function,  f(n)  is 
non -de c re  a s i ng  on  the  sequence  of  nodes  expanded  by  S. 

Proof : Consioer  two  nodes,  i,  j such  that  S closed  i 

before  j.  There  are  two  cases  that  could  arise,  either  j is 

a descendant  of  i or  it  is  not.  it  is  a descendant, 

q(i)  + h(i)  ^ gfj)  ♦ h(j) 

by  the  consistency  assumption  on  h,  since, 
h(i)  - h(j)  4 gfj)  - g(i)  = c(i,j)  . 

Also  by  Theorem  2,  l(i)  ,<  Ifj),  hence,  f(i>  ffj)  and 

the  theorem  is  proved.  If  j was  not  a descendant  of  i,  it 
means  that  when  i was  closed  by  S,  there  was  an  ancestor  of 

j.  Say  k , which  was  in  the  OPEN  list  and  was  not  closed. 

Hence,  ffi)  v<  f(k)  since  i was  chosen.  However,  since  k is 
an  ancestor  of  j , by  the  earlier  part  of  this  proof, 
f ( k ) ffj). 

Hence,  it  follows  that, 

ffi)  ><:  ffj)  and  the  theorem  is  proved. 

The  above  theorem  will  be  used  to  prove  the  optimality 
of  algorithm  S.  The  optimality  property  states  that 
algorithm  S will  not  expand  any  more  nodes  than  any  other 
S-admissible  strategy  which  is  'less  informed'  than  it. 

Theorem 4:  Consider  two  S admissible  strategies,  which 

use  functions,  h,land  h'»  I',  respectively,  where  h and  h' 


are  consistent,  and  such  that. 


29 

l(n)  = I'Cn)  ior  all  goal  ncdeSi  n,  andt 
h(n)  ♦ l(n)  > h'(n)  ♦ l'(r) 

for  all  other  nodes  in  G. 

Then  the  strategy  using  h ' , I " t expands  all  the  nodes 
expanded  by  the  strategy  using  h and  1. 

Proof : Assume  the  contrary  is  true,  i.e.  assume  there  is 
a node  i expanded  by  the  strategy  using  {h,l>,  which  was  not 
expanded  by  that  using  <h I .Bot h algorithms  terminated 
with  the  same  goal  node,  say  n*.  By  Theorem  3, 
g(n*)  + l(n*>  )i.g(i)  ♦ h(i)  ♦ l(i)  . 

Since  l(n«)  = l'(n*>  , n*  being  a goal  node,  and  since, 
h(i)  ♦ I (i ) > h'(i)  ♦l'(i) 

it  follows  that, 

gin*)  ♦ I'ln*)  > g(i)  ♦ h'(i)  ♦ I'll)  = fCi)  . 

However,  by  the  assumption  that  strategy  lh‘',l‘')  didn't 
expand  i,  it  implies  that, 

f(i)  < f(n*)=g(n*)  ♦ I'ln*), 

which  is  the  converse  of  the  previous  inequality.  Thus, 
there  could  be  no  node,  i,  expanded  by  <h,l>  and  not  by 
l.h',  I ')  . 


2.2.2.  k-Step  Lookahead  Heuristic 

In  a particular  classification  scheme,  the  state-space 
graph,  G,  may  be  known  explicitly.  i.e.  the  possible 
orderings  of  feature  measure*  ment s and  decision  making 
hierarchy  are  given.  The  graph  is  also  finite  since  the 
feature  set  and  the  class  label  set  are  both  finite.  One  can 
then  define  a 'k-step  lookahead'  heuristic,  Dk(n),  for  an 
open  node  n,  as  a lower  bound  on  the  additional  cost 
incurred  in  going  up  to  k levels  beyond  n in  the  graph. 


Xn  be  the  medsurements  taken  on  the  path  to  n, 
l(n/Xn)  be  a lower  bound  on  the  risk  of  any  goal  n* 
accessible  from  n,  givers  that  Xn  has  been  observed* 
c(ifj)  be  the  cost  of  the  arc  joining  nodes  i and  j, 

7(i)  be  the  set  of  successor  nodes  of  i»*  and 
&k(n)  be  the  subgraph  whose  node  set  consists  of  n and 
all  unopened  nodes  k or  less  arc  lengths  away  from  n, 
and  whose  edge  set  comprises  all  arcs  in  G joining 
nodes  in  this  node  set. 

Then  the  heuristic  Ok(n)  may  te  computed  using  the 
following  recursive  procedure  tRef.  2*pp.  2293. 

Ok(i)  = Min  lc(i,j)  ♦ Ok(j)  3 

t T (,t) 

where,  for  a terminal  node,  j,  in  6k(n), 

Dk(j)  =0  if  j is  not  a goal  node, 

= l(j/Xn)  if  j is  a goal  node. 

Lemma  1:  If  m > k,  then  Dm(n)  ^ Dk(n)  for  all  nodes,  n. 

The  proof  follows  from  the  additive  nature  of  arc 
costs. 

Corrolary  1:  If  two  S-admissible  strategies  use  the 

evaluation  functions, 

fin)  =g(n)  ♦ 0m(n)  and  f(n)  = 6(n)  40k(n),  respectively, 

and  m > k,  then  the  strategy  using  Ok  will  expand  all  the 
nodes  expanded  by  that  using  Dm. 


Proof:  The  corrolary  follows  from  Lemma  1 and  Theorem  A 


31 


2.3.  li-ddfnissible  Strategies 

An  S-adirissible  strategy  terminates  yben  it  first  puts  a 


goal  node 

0 n 

the  CLOSED  list.  It 

i s 

guaranteed 

that 

this 

goa  1 is 

optimal,  because  of 

the 

assumption 

that 

no 

measurement  s 

taken  on  other  paths 

t 

not  1 ead i ng 

t o 

this 

goal,  can 

change  the  goal  risk,r. 

However,  if  r 

were 

t 0 

change  with  additional  measurements  taken  on  other  paths, 
then  the  optimality  of  the  first  goal  node  put  into  CL  OSLO 
cannot  be  guaranteed,  because  these  observations  nay 

increase  that  goal's  risk  while  decreasing  the  risk  of  sore 
other  goal.  The  question  arises:  if  r is  a function  of  all 
me  a s u r e ' e n t s ( denoteJ  by  X),  is  there  a Search  strategy 
which  finds  the  goal  node  n*  for  which  f(n*)  is  minimum, 
without  taking  all  measurements,  where, 

f(n*)  = c(N(n*))  + r(n*/X)  . (2.3) 

This  section  describes  certain  cases  when  such  a 
strategy,  called  a B-admissible  strategy,  can  be 

formulated.  A B-admissible  strategy  uses  an  upper  bounding 
function,  u,  defined  in  the  following  way. 

De  f i n i t ion  : Let  Xk  be  the  observations  taken  up  to  the  k 
th.  stage  of  a search,  and  let  n be  a goal  node  in  6.T(ten 
an  upper  bounding  function,  u,  is  defined  as, 

u(n/Xk)  Max  r(n/Xk,Xk) 

where  Xk  is  the  complement  of  the  observations  Xk,  with 
respect  to  the  total  set  of  available  observations,  X. 

Algorithm  6 described  below,  uses  an  upper  bounding 
function  u,  and  a lower  bounding  function  analogous  to  that 


32 


used  for  S-adm i s si b I e strategieSt  to  obtain  the  B-optimal 
solution,  n*.  this  algorithm  is  similar  ^to  Algorithm  S 
except  that  it  does  not  terminate  when  a goal  node  is  put  in 
the  CLOSED  list.  Rather,  at  each  iteration  after  a goal  node 
exists  in  CLOSED,  it  checks  if  there  is  some  goal  node  in 
it,  such  that  the  upper  bound  on  its  cost  (for  any  possible 
future  measurement  sequence)  is  less  than  the  lower  bound  on 
the  cost  of  any  other  goal  node,  either  in  CLOSED  or  below 

some  open  node  in  the  graph.  When  this  condition  is  ^ 

satisfied,  it  terminates  with  that  goal  node  as  the 

decision. 

2.3.1.  Algorithm  B 

Let  OPEN  be  the  set  of  open  nodes.  Let  CLOSED  be  the  set 
of  goal  nodes  that  have  been  closed. 

S t e (j 0:  Set  OPEN  to  the  initial  state,  and  CLOSED  to 

null.  Set  k,  the  stage  variable  to  0. 

S t e_p_  1 : Set  k = k+1.  For  each  n €.  OPEN  compute, 

f ( n ) = g ( n)  h (n  ) ♦ I ( n ) , where, 
g(n)=c(N(n)) 

h(n)=0  if  n is  a goal  node. 


-f*in  [ c(N(j)-c 

a ^ 

(N ( n)  ) 

3 

i t 

n 

is  not 

a goal. 

l(n)  - Min  r(n/Xk 
Xk 

,X  k ) 

if 

n 

i s 

a goa  1 

node  , 

kMin  [Min  r(j/Xk 

,Xk)3 

i f 

n 

is 

not  a 

goal. 

icWn 

Step  2:  Let  n*  be  the  node  with  a minimum  value  of  f(n). 
If  n«  is  a goal  node,  remove  it  from  OPEN  and  put  it  in 
CLOSED.  If  n*  is  not  a goal  node,  take  the  measurement 
associated  with  that  state,  and  replace  n*  in  OPEN  by 
its  successors. 


i 


■I 


Step  3:  For  each  goal  node  n in  the  CLOSED  list,  compute 


3? 


b(n)  = c(N(n))  u(n/Xk)t 

where  u()  is  an  upper  bound  on  the  risk  r,  i.e, 

u(n/Xk)  ^ Max  r(n/Xk,Xk) 

Let  n*  be  the  node(goal)  with  a minimum  value  of  bin). 

If  for  all  n in  OPENUCLOSED  , n ^ nx, 

f(n)  > b(n*>» 

STOP,  with  n*  as  the  B-optimal  solution,  else  go  to 

step  1 . 

Note  that  if  in  step  3 the  inequality, 

f(n)  > b(n*)  is  satisfied,  the  inequality, 

f(n)  > f(n*>  is  also  satisfied,  because, 
bin*)  ^ fin*)  • 


Theorem  S:  Algorithm  B is  B-adm i s s ib I e , i.e.  it 
terminates  with  the  solution,  s*,  such  that, 

c t N I s * ) ) * r I s* /X  ) ^ clNls))  rls/X)  for  all  goals  s in  6, 
and  X is  the  total  set  of  features  that  can  be  used. 

Proof : Algorithm  B will  terminate  when  it  finds  at  stage 
k,  a node  n*  in  CLOSED,  such  that,  for  all  other  nodes  in 
OPEN  or  CLOSED , 

bln*)  ^ clNln))*hln)*lln). 

Suppose  n is  any  node  iV  n*)  in  CLOSED,  then, 

hln)=0,  lln)  ^ rln/X)  , from  the  definitions  of  h,  I, 

fi  e n c e , 

c I N In  * ) ) * r In  */ X ) 4:  c I N I n * ) ) ♦ u I n*  / Xk  ) 

= bln*) 

< clNln))  ♦hln)  *lln) 


34 


^ c (N(n>  ) ♦ r (n/X)  , 

since  h(n)  = 0 for  any  goal  nodet  n. 
Hence,  n*  is  a better  goal  than  any  n in  CLOStO.  Suppose 
n was  a goal  in  Wj  for  some  node  j in  OPEN.  Then, 
b(n*)  < f(j)  = c(N(j))  +h(j)  ♦ l(j) 

^ c(N{n)>  r<n/X>,  because, 

by  fhe  definition  of  h , I, 
h(.)  ^ c(j,n)  for  any  goal  n in  Wj,  where  c(j,n) 
denotes  the  cost  of  the  path  between  j and  goal  n.  Also, 
l(j)  fiin  r(n/Xk,"xk)  ^ r(n/X)  , by  the  definition  of  I 
in  step  2 of  algorithm  B. 

Hence, 

c(N(n*))  ♦ r(n«/X)  < b(n*)  < c(N(n))  * rln/X)  , 

Thus,  n*  is  a better  goal  than  any  goal  n,  under  any 
open  node  j.  Since  all  the  goal  nodes  are  either  in  CLOSED 
or  under  some  OPtiJ  node,  and  none  is  better  than  n*,  across 
all  measurements,  X,  n*  is  B -optimal,  and  the  algorithm  is 
B-admi ssiole. 

In  Theorem  4,  a method  of  comparing  two  S-admissible 
algorithms  was  presented.  The  next  theorem  allows  one  to 
compare  two  B-admissible  algorithms  using  different  bounding 
functions,  u.  It  states  that  if  two  B-admissible  strategics 
use  the  same  evaluation  function  to  order  the  open  nodes, 
but  one  uses  a more  'informed'  upper  bounding  function  than 
the  other,  it  can  never  expand  more  nodes  than  the  less 
informed  strategy. 

Theorem  6:  Consider  two  P-admissible  strategies  using 

the  same  functions,  h,l,  but  different  upper  bounding 
functions,  u,  and  u',  such  that,  for  all  goal  nodes,  n,  and 
measurment  sets,  Xk, 


u(n/x  k) 


< u'(n/Xk) 


3^ 


Then  the  strategy  using  u'  expands  all  the  nodes 
expanded  by  the  strategy  using  u. 

Prci^f:  Since  both  the  strategies  use  the  same  fu fictions, 


h,  a rid  1,  they 

u se  the  same 

ev a 1 u at  ion 

function,  f 

, t 0 

decide  the  node  to  be  closed.  Hence 

g i ven 

that 

both 

strategies  have 

gone  t hr ough 

k nodes  , it  follows 

that 

both 

expand  the  same 

nodes  in  the 

same  sequence 

. Thus 

, the 

Only 

question  is  to  decide,  which  strategy  terminates  first,  i.e. 
for  which  strategy  is  the  condition, 

b(n«)  < f(n)  satisfied  earlier. 

Since,  c(N(n*)>  u(n*/Xk)  < c(N(n*))  u'(n*/xk), 

for  all  Xk  , it  is  clear  that,  the  termination  criterion 
iri  step  5 of  the  algorithm,  will  be  satisfied  earlier  for 
thf  strategy  using  u,  than  for  that  using  u'.  Hence  u'  will 
expand  all  nodes  expanded  by  u. 

Unfortunately,  one  cannot  compare  two  strategies  us  wiq 
different  heuristics,  hel,  and  say  h'+l',  since  in  th»s 
case,  the  two  strategies  would  expanc  different  nodes,  and 
it  is  possible  for  the  less  informed  strategy,  i.e.  the  one 
with  the  smaller  value  of  h'tl  at  each  node,  to  terminate 
earlier,  because  the  measurement  it  takes  may  reduce  the 
bound  o(n*>  sharply  thus  causing  other  classes  to  get 
rejected  earlier,  than  for  the  more  informed  strategy. 

^.3.?.  Fiayes  Optimal  Strategy 

A Bayes  optimal  classification  scheme  is  one  which 
minimi;es  the  average  risk.  Let  X be  the  total  set  of 
xiea  surement  s on  a sample,  then  if  there  are  M classes, 
w1,w?..wM,  the  average  risk  of  classifying  X in  wi,  is, 

r(wi/X)  = Z^eij.  p(u)/X) 


36 


where  cij  is  the  loss  incurred  in  classifying  a sample 
from  wj  into  ui*  and  p(wj/X)  is  the  likelihood  of  wi  given 
X.  Hence*  a Hayes  optimal  strategy  would  classify  X into 
class  w*  such  that 

r(w*/X)  = Min  r(wi/X) 

»A 

from  the  above  equation,  it  follows  that  such  a strategy 
can  be  implemented  as  a special  case  of  a B-admissible 
strategy,  wherein  the  arc  costs  do  not  influence  the  cost  of 
a goal  s*.  The  only  change  required  in  algorithm  B is  in 
Step  3 which  becomes  : 

Step  3:  For  each  goal  node  n in  CLOSED  compute  the  upper 
bound  uln/xk),  on  the  risk  r(n/X),  given  the  measurements, 
Xk  have  been  observed  so  far.  Let  n*  be  the  node  with 
minimum  value  cf  u().  Then,  if  for  all  nodes,  n in  0PEN\) 
CLOSED,  other  than  n*, 

u(n*/X  k)  < l(n/Xk)  is  true, 

STOP,  with  na  as  the  Bayes  optimal  classification,  else 
go  to  Step  1. 


2.3.3.  Graphical  Representation  of  B-admissible  Search 

The  role  played  by  the  evaluation  heuristic,  h«l,  and 
the  upper  bounding  heuristic,  u,  can  be  depicted  graphically 
in  2-space.  In  this  space,  each  node,  open  or  closed,  is 
represented  by  its  values  of  f ( n) , and  b(n).  Though 
algorithm  B computes  b(n)  only  for  goal  states  n in  CLOSED, 
one  could  define  it  for  non-goal  nodes  as. 


37 


b(n)  =■  Min  [ c(N(j))  ♦ Max  r(i/Xk,Xk)  3 

icWn 

where  j is  any  goal  node  below  node  n,  i.e»  j C.Wn. 
figure  ?.1  shows  a display  of  all  open  and  closed  nodes 
in  the  b-f  plane.  Since  b(n)  ^ f(n),  all  points  lie  above 
the  45  degree  line,  01,  Node  a has  the  minimum  value  of  < 
and  will  bt  closed  next,  while  node  b in  CLOSED  has  the 
smallest  value  of  b(n),  and  hence,  all  nodes  with  ffti)  >, 

Min  bln)  will  be  rejected  from  further  consideration.  The 

n c o uOSc'D 

reject  region  is  marked  in  the  figure.  Thus,  the  searcfi  is 
reduced  to  the  domain,  R,  and  the  nodes  in  this  region 
contain  the  potential  B-optimal  node,  or  an  ancestor  of  the 
P-optimal  node, Figure  2,2  shows  the  situation  when  the 
algorithm  terminates  with  the  soluticn,  n*.  All  other  nodi  s 
now  fall  in  the  reject  region.  The  movement  of  the  nodes  in 
the  b-f  space  as  more  measurements  are  taken,  is  shown  by 
small  arrows  in  fig.  2.1.  That  they  should  move  in  such  a 
direction,  follows  from  the  fact  that  f must  increase  and  b 
decrease  as  more  measurements  are  taken. 


F\q.  2.1 


3R 


2. A.  Methods  01  Computing  Bounding  Functions 

Having  discussed  the  properties  o1  S and  B -adm i s s i b I e 
search  strategiest  uc  consider  here,  methods  of  computing 
lower  and  upper  bounding  functions,  I,  u,  for  some  special 
cases.  Tuo  broad  types  of  risk  functions,  r(n)  , are 
cons ide red : 

. r is  a function  of  class  conditional  probability 
distributions  or  parameters  of  oi s t r i bu t i on s 
(parametr  ic  )• 

. r is  defined  in  terms  of  a set  of  labelled  samples 
and  certain  similarity  or  distance  measures  between 
samples  (n onpa rame t r t c methods)  • 

The  two  types  of  parametric  methods  considered  are: 

. the  measurements  are  class  conditionally 
statistically  independent,  and, 

\ . the  measurements  satisfy  a first  order  tree 

dependen  ce  * 


The  two  types  of  nonpa r ame t r i c methods  described  are: 
. nearest  neighbour  classification  using  Euclidean 
distance  , 

{ 

f 

I 

t 


. classification  using  a similarity  mcasurp  between 
binary  vectors. 


2.4.1.  Statistically  Independent  features 

Assume  that  the  features  are  c I a ss -c ond i t i on  i I t y 
statistically  inoependentt  i.e.  the  probability  pCx/wi).  can 
be  written  as, 

p(X/wi)  = TT  p(xt/wi)  where  X = ( x 1 , x 2 , . . . x N ) . 
t-  t 

The  average  risk  , r(wi/X),  is  given  by, 

r ( w i / X ) = c i j . P ( w j / X ) where  cij  is  the 

cost  of  making  decision  wi,  when  the  correct  decision  is 
w).  Assuming  a unit  loss  for  incorrect  decisions  and  no  gain 
for  correct  decisions,  ^ 

r(wi/X)=  1 - P ( w i ) .TT  p ( ),  t/ y i ) 


■t=l 


Xp  (wj).TTp(xt/wj) 


(2.4  ) 


& 


t 


where  P(wj>  is  the  a priori  probability  of  class  j. 

If  X 1 Cl  X is  the  set  of  features  already  measured,  then 
let  l(wi/X1),  u(wi/Xl),  be  the  lower  and  upper  bounds, 
respectively,  on  r(wi/X),  i.e. 

I (wi/ X1  ) 

u(wi/Xt  ) 

7. 

respec  t to  the 
total  set  of  measurements,  X,  upon  which  r depends. 

Define, 

lt(i>  - l^n  p(xt/wi) 

ut(i)  - f^ax  p(xt/wi) 
y xt 

Then,  from  E qn . (2.4),  and  the  above  definitions  of 

ut • we  can  define  the  bounds  as,  fol  lows: 


tlin  r (w  i /Xl  ,X  1) 

^ t _ 

Max  r (wi /XT , XT  ) 

7, 

where  Xl  is  the  complement  of  XT  with 


1 1 , 


40 


P(wi).p(Xl/wi)  .TT  u t ( i > 
b 

l(yi/Xl)=  1 — (2.5a) 

I P (w  i ) .p  ( X1 /w  i ).  TTut  ( i ) ♦ 

-t 

X p<wi>.p(xi/wj ).  riit(j)  > 

P(wi).p(X1/wi)  .X  1 1 ( i ) 

u ( y i / X 1 ) = 1 - s-r.  ..(2.5b) 

< P (y  i ) . p (X  1 / yi  ) . r\  1 1 ( i ) ♦ 

■b 

Tp(yj).p(X1/yj),  TT Ut(j)  ) 

In  equations  ( 2 . 5a ) - ( 2 . 5 b ) above,  the  product  terms  in 
It.  ut I are  taken  only  lor  those  features.  * t . that  are  not 
in  the  observations  already  taken,  i.e.  xt^Xl. 


The  above  bounds  are  too  pessimistic  because  they  assume 
that  the  values  of  XI  that  m i n im  i ; e / m a x i m i z e p(X/yi), 

simultaneously  m ax i m i z e / m i n i m i z e p ( X /y j ) for  all  j ^ i .One 

could  obtain  more  tight  bounds  by  computing  plyi/Xl.TItl)  for 
dll  X1 . but  for  the  case  of  N features.  M classes,  and  each 
feature  taking  on  m states.  this  yould  require  M.m*^ 
calculations,  yhich  is  a large  number  even  for  moderately 
small  values  of  kl.  m.  and  N.  Hoyever.  there  is  a special 

case,  yhen  the  tight  bounds  can  be  computed  yith  no  more 

computations  than  are  required  in  cbtaining  t.  u.  using 
(2.5a-b).  above. This  is  the  case  yhere  the  unconditional 
distribution  of  X can  be  yritten  as  a product,  i.e. 

p(X)  = TT  p(xt) 
t=l 


Then  one  can  define. 

lt(i)  = Min  p(xt/yi)  / p ( xt  ) 
fr  Xt 

ut(i)  = Max  p(xt/yi)  / p(xt> 

y Xt 

Writing  p(X)  for  the  denominator  of  (2.4).  and  using  the 
above  definitions,  one  can  define  tight  loyer  and  upper 
bounds  on  the  likelihood  of  class  y i.  given  observations. 


u'(ui/)M)  = ^ - P(wi/x1).  Tlu'(i)  (2.6b) 

t 

Using  the  bounds  given  by  (2.5a-b)»  or  (2.6a-b!, 
requires  coitiputing  the  2NM  quantities*  lt»  ut*  or  It',  ut  ' 
respectively. 


r 


4? 


t. » a mp  I e : 

Consider  the  following  example  in  which  4 features^ 
xf..x4,  are  available  to  classify  the  sample  into  one  of  3 
classes.  Each  feature  can  be  in  one  of  4 statc-St  a«b,c  and 
dt  and  it  is  assumed  that  the  features  are 
c la s s-c ond i t i o na  1 1 y statistically  independent.  The 
probabilities  are  tabulatedi  and  the  maximum  and  minimum 
values  of  p(xj/wi)  are  shown  in  the  last  two  rows  for  each 
class  wit  and  each  f ea  t ure  t » j • These  are  the  values  uj(i)f 
I 3 ( i ) re  spe  c t i ve  ly  . 


Class  T Class  2 Class  3 


■ 

Xl 

x2 

x3 

x4 

xl 

x2 

x3 

x4 

xl 

x2 

x3 

x4 

H: 

.4 

. 3 

.2 

.2 

.5 

.4 

. 1 

.1 

.1 

.4 

.1 

.2 

B 

.4 

.5 

.4 

.2 

.2 

.1 

.1 

.1 

.3 

.1 

.3 

.3 

B 

.1 

.1 

.2 

.4 

.2 

.3 

.6 

.1 

.5 

.2 

.3 

.2 

B 

.1 

.1 

.2 

.1 

.2 

.2 

.7 

.1 

.3 

.3 

.3 

ul 

.4 

.5 

.4 

• 4 

.5 

.4 

.6 

.7 

.5 

.4 

.3 

.3 

It 

. 1 

. 1 

.2 

.2 

.1 

II 

11 

• H 

II 

.1 

.1 

.1 

.1 

.1 

.2 

Assume  that  features  x1  and  x2  have  been  observedt  and 
that  their  values  are  x1=at  and  x2  =b  . Using  equations  (?a) 
and  (2b)  one  can  compute  the  lower  bound  and  upper  bound  on 
the  risk  for  each  decisiont  wi»  denoted  by,  l(wi/x1,x?)t  and 
u(ui/x1ix2). 


.4.5.1 3(1). 14(1) 

u(w1/x1 ,x2)=1  - 

[.4.5.13(1 ).  14(1  ) ♦ .5 .5 .u3(2 ) .u4 (2) 
♦ .1.1.u3(3).u4(3)  3 


* 1 - 


4 


4.5.2.2/(.4.5.2.2  « .5. 1.6. 7 4 .1.1. 3. 3) 


I(w1/x1-a, x2=b) 

= 1 -320/327 

= 

0 .0021 

u(w?/x1sa, x2=b) 

= 1 -5/334 

= 

0 .985 

l(w?/x1=a, x2=b) 

= 1 -210/292 

0 .280 

u(w3/x1=a, x2=b) 

1 -2/532 

= 

0 .995 

I(w3/x1=a, x2=b) 

= 1 -9/94 

= 

0 .905 

From  the  above  values  it  is  clear  that  class  w3  can  be 
rejected  from  consideration,  since  after  the  first  two 
measurements,  the  lower  bound  on  the  risk  of  making  the 
decision  'class  I',  vi/,  I(w3/Xl),  is  greater  than  the  upper 
bound  on  the  risk  of  deciding  in  favcur  of  class  w1,  vii. 


u(w1/Xl)  . 


44 


2«4.2.  Tree  Dependent  features 

In  many  applications  the  assumption  of  statistical 
independence  is  unjustifiable*  and  one  seeks  a mere 
elaborate  model  of  feature  dependence.  One  such  model  is 
that  of  first  order  tree  dependence  [233*  wherein*  the  class 
conditional  distribution  o f is* 

p(X/wi)  = plxl/wi).  IT  P ( * j / * t ( j ) *w  i ) where  t(j)  < j 
i.e.  the  features  can  be  ordered  so  that  the  probability 
of  X can  be  written  as  a product  of  first  order  conditional 
probabilities.  The  dependence  between  the  features  can  be 
represented  by  a dependence  tree*  Td  , such  that  there  are 
nodes  labelled  x1,x2*...xn*  i.e.  the  feature  names*  and 
there  is  a directed  edge  from  node  t(j)  to  j for  j=2*..W. 
for  the  tree  of  fig.  2.3  for  example*  p(X/wi)  can  be  written 
as  * 

p(X/wi)  =p(x1/wi).p(x2/x1*wi).p(x3/x1*wi).p(x4/x2*wi). 


p( x5/x2*wi) 

Xi 


It  is  assumed  in  the  following  discussion  that  the 
features*  xt*  t=1*2..N*  are  discrete*  xt*  taking  on  one  of 
mt  states.  We  shall  define  a measurement  graph*  a spanning 
tree  and  a real  valued  function  on  a tree*  and  then  describe 


5 


a method  of  computing  lower/upper  botnds  on  the  risk  r(wi/x) 
given  that  a subset  of  features  X t X , have  been  observed. 


De f i ni t i on : A measurement  graph,  G(Td,ui),  for  class  wi, 

and  dependence  tree,  Td,  is  defined  thus:  If  are 

the  labels  of  nooes  in  Td,  i»e.  features,  then  the  nodes  of 
(j  can  be  divided  into  disjoint  sets,  A1,...AN,  such  that  the 
nodes  in  Ak.  are: 

Ak  = i xk  ( 1 ) , X k ( Z ) , * . • X k ( m k)  } 
i.e.  xk()>,  j=1,Z,...mk  are  the  states  of  feature  xk . 
There  is  a directed  edge  from  xk(j)  to  xl(m)  in  & if  there 
is  an  arc  from  node  xk  to  xl  in  Td.  The  arc  in  6 has  a 
'cos*^  givto  by. 


c ( xk ( j ) ,x I ( m)  ) = log  p ( x I ( m ) / x k ( j ) , u i ) 


Thus,  the  arc  costs  in  G are  the  log  of  the  conditional 
probabilities  in  the  product  form  for  p(X/wi).The 


product 


measurement  graph  corresponding  to  Fig.  Z.3,  assuming  binary 
features,  is  shovn  below. 


xicyi 

C C*3^2.),XiC2)) 

(xsW  = 'og  p C X-3  2),^^} 


fig.  Z . A 


Neasurement  Graph 


Definition;  A terminal  node  of  graph  G is  one  having  no 
outgoing  edges  from  it,  e.g.  in  Fig.  Z.A,  x3(1),  x3(Z), 
xA(1).  xAlZ).  h5  (1).  x5(Z>  . 


47 

De  t i n i t i on  : The  weight  of  a spanning  tree,  Ts,  is  the 
sum  of  the  arc  costs  in  Ts. 

Definition:  A m i n i ma  I / iti  a x i m a I spanning  tree  is  one  whose 
weight  is  minimum/  maximum  among  all  spanning  trees  of  graph 
G(Td,wi). 

Definition:  A constrained  spanning  tree,  Ts(Xl)  of  G is 
a spanring  tree  of  G containing  all  the  nodes  (featu.e 
states)  in  XI.  Two  constrained  trees,  T s ( * 1 ( 2 ) , x 4 ( 1 ) , * 5 ( ? ) ) 
are  sfiown  below. 


Definit ion  : A minimal/maximal  constrained  spanning  tree 
is  a constrained  tree  having  m i n i mu  ir  / ma  x i m urn  weight  among 
all  such  trees. 

If  ri(T)  denotes  the  weight  of  tree,  T,  then  the  minimal 
and  maximal  weights  of  Ts(x1),  denoted  by  li(Xl),  ui(Xl), 


are, 

li(yl)  Min  d(TsfXl))  (2,7a) 

VT^IX.) 

ui(Xl)  = Max  d(Ts(XD)  (?.7b) 

VTsCX.) 


Since  the  arc  weights  are  logs  of  the  conditional 


48 


probabilities!  lit  ui»  are  the  log  functions  of  the 

minimum/max  iinuin  values  of  p(X/wi)  given  that  measurement  s i 

XI  have  been  taken. 

lifXl)  » MJn  C log  p(X1,X1/wi)  D 

X, 

ui(XI)  = MjBx  C log  p(Xl,X1/wi)  3 

X, 


In  the  above  equations  Tl  is  the  measurement  set  not  yet 
observed.  Substituting  for  plX/wi)  in  equ.  2.4t  the  bounds 
bee  ome  i 


P(wi).exp(ui(XD) 

l(wi/Xl)  =1-  

<P(wi).exp(ui(xl>) 

/L  (2.8a) 


P(wi).exp(li(x1>) 

u(vi/Xl)  = 1 - 

{P(Mi).exp(li(X1))  ♦ ^P(vi).exp(uj(xl))) 

(2.3b) 

The  above  bounds  are  loose  because  it  has  been  assumed 
that  the  measurements  XT  which  ■ i ni m i 2 e /max im i ze  p(X/wi) 
simultaneously  m ax i m i ze / m i n i m i z e p(X/wj)  for  all  j V i. 
Tight  bounds  can  be  obtained  with  no  more  computation  than 
that  required  for  equation  ( 2 . 8 a ) ■>  ( 2 .8b  ) t if  it  is  assumed 
that  the  unconditional  probability  of  Xt  p(X)t  can  also  be 
written  as  a product  of  first  order  conditional 
probabilities*  as* 

N 

W p(X)  = p( X 1 ) . p (x j / X t ( j ) ) tor  j»2t****N 

where  t(j)  < } • 

In  this  case*  we  define  an  augmented  measurement  graph 
as  f ol  lows: 

Definition:  8n  augmented  measurement  graph*  G'(Td*wi) 

I 

» 

I 

5 

f 


tor  class  wi,  and  dependence  tree,  Td,  is  identical  to  the 
graph  G(Td,wi),  except  that  the  arc  cost,  c ( x k ( j ) , x I ( tn  ) ) is, 
c ( X k ( j ) ,x  I ( m ) ) = log  p(  X I (m  )/x  k ( j ) , w i ) 

- log  p(xl(r)/xk(j)) 

The  second  term  is  the  log  of  the  probability  of  xl(m) 
giver.  xk(j),  but  not  conditioned  on  wi. 

One  can  then  define  the  min/max  veight  of  a constraified 
spanning  tree  of  the  augmented  g r ap h , 6'(Td,ui),  for  a 

measurement  set,  XI,  as,  I i " ( X f ) , ui'(Xl)  respectively,  as 
before.  The  bounos  on  the  risk  of  decision  wi,  given  XI, 
become  , 

l'(wi/Xl)  = Max  P(wi  ) .plXl  ,1(1 /wi) 

1 - X,  = 


p(X 1 ,X  1) 

l'(wi/X1)  = 1 - Piui),  exp  (ui'(XD)  (2.Va) 

Similarly, 

u'(wi/X1)  = 1 - P(wi).  expdi'(Xl))  (Z.Vb) 

To  use  the  bounds  defined  by  equations  (2.8a-b), 
(2.9a-b),  one  needs  an  efficient  method  of  computing  the 
minimal/maximal  spanning  trees  for  a graph.  One  such  method, 
which  is  a modification  of  the  dynamic  programming 
formulation  for  finding  the  shortest  path  through  a graph, 
is  presented  next. 


2. ^.2.1.  Computation  Of  M i n i nia  I /Ma  x i ma  I Spanning  Tree 

Assume  that  one  wants  to  compute  li(Xl),  ui(Xl)  ,the 
minimal/maximal  spanning  tree  weights,  for  a measurement 
graph,  G(Td,wi>.  Denote  by  Ak  the  set  of  nodes  of  6 
labelled,  xk(1),  xk(2>,  ..xk(mk),  where  mk  is  the  number  of 


50 


► 


possible  states  of  feature  xk.  Let  D1(xk)  be  the  descendant 
nodes  of  node  *k  in  the  dependence  tree  Tdt  aod  DG(xk(j))> 
the  descendant  nodes  (sons)  of  rode  xk(j)  in  G.  The 
following  algorithm  recursively  computes  the 
longest/shortest  path  length  from  a node  to  all  terminal 
nodes  in  G.  for  the  case  when  the  measurement  graph,  G, 
consists  of  N features,  each  taking  cn  one  of  m states,  this 
algorithm  requires  computations  of  the  order  of  2.(N-1).m^ 

Step  1:  Prune  6,  as  follows:  If  xk(j)£Xl,  the  observed 
set  of  features,  delete  all  nodes  (and  attached  edges)  in  Ak 
from  G,  except  xk(j).  In  this  way  for  each  observed 
feature,  all  states  except  the  observed  one  are  deleted. 


Step  Compute  the  functions  li(xk(s)>,  ui(xk(s)),  for 
each  node  xk(s)  in  the  pruned  graph  using  the  recursive 
equations: 


li(xk(s)) 


[Min{  c(xk(s),x)(t))  ♦ li(xj(t)))3 


x)£0T(xk)  X j ( t )C dG ( X k (s ) ) 


ui(xk(s)) 


. r 

xjcDT (xk) 


CMaxC  c(xk(s),xi(t))  ♦ ui(xj(t))}] 
xi(t)cOG(xk  (s)J 


where  c(xk(s>,  xj(t))  denotes  the  arc  cost  along  the 
edge  joining  nodes  xk(s),  xj(t)  in  G,  and  for  terminal  nodes 
of  G, 

li(xj(t>)  = ui(xj(t))  " 0 for  xj(t)  terminal  • 


If  x1  is  the  root  node  of  the  dependence  tree,  Td,  then 
the  lower/upper  bounds,  li(Xl),  ui(X1)  are. 


li  (XI  ) 


Kin  li(x1(t>) 

t 


U1  (XI  ) 


Max  tii(x1(t)) 

t 


Ihe  correctness  of  the  above  algorithm  follows  from  the 
additivity  of  arc  costs  along  a path  and  from  the  optimality 
principle  of  dynamic  programming. 

Example : 

The  following  example  illustrates  the  use  of  the 
algorithm  for  finding  the  minimal/maximal  spanning  trees  cf 
a measurement  graph.  The  example  considered  is  that  of  Fig. 
?.3  assuming  that  each  feature  is  binary.  The  first-order 
conditional  probabilities!  conditioned  on  class  w1,  and  the 
cor  respond! ng  measurement  graph*  G(To,w1)  are  shown  in  Fig. 
?.7a  and  2.7b.  fig.  2.7c  shows  the  minimal  and  maximal  paths 
traced  by  the  algorithm.  The  dotted  edges  indicate  the 
maximal  tree,  the  solid  edges,  the  minimal  tree.  Arc  costs 
are  logfbas,  10)  functions  of  the  probabilities.  Next  to 
each  node  in  Fig.  ?.7c,  the  pair  of  numbers  '(a,b)',  are  the 
minimum/m.iximum  costs,  viz.,  li(xkfs)),  and  ui(xl(s)> 
defined  in  step  2 of  the  algorithm,  for  terminal  nodes  of 
the  measurement  graph,  these  numbers  are  both  zero.  The 
algorithm  bads  up  from  the  terminals  and  computes  (a,b>  for 
higher  level  nodes  using  the  recursive  equations  described 
in  step  ?.  The  values  of  the  unconstrained  minimal  and 


maxitr.al  tree  weights 

are 

-2.962 

a nd 

-.927  respective 

ly. 

T tie  se  c o r r e sp  ond 

t o 

the 

measurement 

vec  t 0 

» s , 

(x1=0, X?=1 , X?=  1,  x4  = 2 

,x5  = 1 ) 

and  (11101) 

which  give 

rise 

t o 

the  min/max  values 

o f 

p(X/w1), 

viz 

.,0.00108 

and  0 . 

11H 

respectively. 


2.4.3.  Nearest  Neighbour  Classification 


A nearest  neighbour  (nn)  classification  scheme  is  a 
non parametric  method,  i.e.  it  makes  ro  assumptions  regarding 
the  form  of  the  class  conditional  distributions  of  features. 
It  uses  instead,  a set  of  labelled  design  samples  and  a 
distance  measure  between  two  samples  to  classify  a test 
sample,  X.  The  sample  X is  labelled  with  the  class  name  of 
its  nearest  neighbour  in  the  design  set. 

The  nearest  neighbour  can  always  be  found  by  computing 
all  distances.  However,  when  the  design  set  siie  is  large, 
the  computations  become  large,  and  a variety  of  schemes 
have  been  proposed  to  overcome  this  difficulty.  They 
comprise  of  representing  the  design  set  by  a small  number  of 
prototypes  (condensed  NN  rule, [303),  and  preprocessing  of 
these  prototypes  C6,313,  so  that  the  nearest  neighbour  can 
be  found  with  a small  average  number  of  distance 
calculations. 

In  this  section,  a state-space  model  of  nearest 
neighbour  classification  is  presented,  and  it  is  shown  that 
methods  such  as  in  16  3 are  special  cases  of  B-admissible 
search  strategies  with  zero  measurement  (arc)  cost  assumed 
in  the  graph.  The  state-space  model  also  has  the  advantage 
of  being  able  to  accomodate  extensions  of  the  these  schemes 
to  other  NN  schemes,  notably  the  following: 

. a new  method  called  'subspace  nearest  neighbour  rule' 
which  allows  one  to  determine  the  nearest  neghbour  of 
X in  a subspace  of  the  total  measurement  space,  and  to 
trade  accuracy  for  reduced  measurement  cost  ; 

. extend  the  concept  of  S-  and  B-admissible  strategies 
to  schemes  using  non-metric  features  and  a similarity 


measure  between  samples  to  label  a test  samplct 
The  melbod  of  using  prototypes  to  represent  the 
design  set  in  such  cases  and  methods  of  computing 
lower/upper  bounds  on  the  similarity  measure  are 
discussed. 

?.4.3.1.  The  Model 

A state  Space  model  for  non-pararretric  schemes  may  be 
defined  as  a 7-tuplet  G ( S . E t > U t c . r t T ) • as  beforet  but  with 
the  following  changes  in  the  connotation  of  certain  symbols: 

A stale  5 represents  a subsett  Dst  of  samples  from  the 
total  design  sett  Of  and  is  denoted  by  a 3-tuplei 
(xSfi^Stls)t  wheret 

*s:  is  the  measurement  to  be  taken  on  the  test  sample 
when  this  state  is  reachedt 

ks:  the  classes  to  which  the  samples  in  Ds  belong. 

This  set  is  known  since  the  samples  are  labelled. 

Is:  denotes  information  about  the  sample  set  Dst  e.g. 
its  mean  vectort  measures  of  its  dispersiont  etc. 

The  same  sample  can  occur  in  two  different  sets*  Osit 
Dsj  where  states  si,  sj  may  not  be  related  ( one  may  not  be 
a descendant  of  the  other).  However,  one  restriction  on  the 
sets  is  that  if  state  t is  a descendant  of  state,  s,  then 
OtClos  must  hold. 

The  real  valued  function  c on  the  edges  E of  the  state 
space  graph,  represents  the  cost  of  traversing  an  arc  and 
could  represent  either  the  measurement  cost,  or  the  cost  of 
computing  the  distance  measure  between  the  test  sample  and 
the  samples  in  that  set,  Ds.  The  function  r,  used  in 
earlier  discussions  to  denote  the  mi  sc  I ass i f icat ion  risk,  is 
used  here  to  represent  the  distance  between  the  test  sample, 
X,  and  a particular  design  sample.  Vs*.  A goal  state  s*  is 


56 


one  for  which  )os*)  =1 t i.e.  it  represents  a single  sample, 

say  Vs • . Hence  , 

r(s*)  » d()(,Ys*;f),  where, 
d is  the  distance  measure  used  , 

F is  the  space  of  features  in  which  d is  computed, 
X,  Ys«  are  the  test  and  design  samples 
re  spec  lively. 

Let, 

Fs  be  the  features  measured  on  the  path  to  state  s, 
c(Fs)  be  the  sum  of  arc  costs  on  the  path  to  s. 

Then  the  following  types  of  nearest  neighbour  schemes 
can  be  formulated  : 

Scheme  1:Find  the  goal  s*,  i.e,  the  design  sample  Ys*, 
such  that, 

f(s*>  » c(Fs*)  ♦ d(X,Vs*;Fs*) 

* Min  [ c(Fs)  ♦ o(X,Vs;Fs)  3 

s 

where  s is  any  goal  node. 

Scheme  A>:  Find  the  goal  s«,  such  that, 
f(s*>  = (fs*)  ♦ d(X,Ys*;F) 

* M^n  t c(Fs)  ♦ o(X,Ys;F)  3 
where  s is  any  goal  node,  and, 

F is  the  total  set  of  features, 

|m.  Scheme  3:  Find  the  goal  s*,  such  that, 

f(s*>  * d(X,Vs*;F)  » Min  Cd(X,Ys;F)3 

s 

where  s is  any  goat  state. 

Schemes  (1)  and  (2)  assume  that  measurement  cost  (arc 
: cost)  is  significant,  and  hence  attempt  to  trade  this  cost 


i 


3 7 


f or  reouc  ed 

a c 

curacy, 

it  is 

likely  that 

they 

may 

no  t 

terminate  w 

i 1 h 

the  truly  nearest 

ne  i ghtioiir  . 

3 c h e m c 

(3) 

1 s 

the  conventional  NN  scheme 

assuming  that  arc 

costs 

do 

no  t 

contribute 

t 0 

the  solution 

cost  . 

Sch  ernes  ( 1 ) 

- (3) 

can 

be 

implurentPd  as  S-admissiblci  B-admissible*  and  Bayes 
admissible  strateaies  respectively. 

Scheme(l)  assumes  that  the  design  sample,  s*,  is 
adequately  represented  by  the  teatures,  fs*,  measured  on  the 
path  to  it,  and  that  other  features  are  not  significant  in 
the  distance  measure  i.e.,  d(X,Ys*;F)  tan  be  approximated  by 
d ( X , V s * ; f s * ) . Hence  the  space  in  yhich  the  d measure  is 

computed  depends  upon  the  design  sample  and i^  class  l^bel  . 

Such  an  assumption  may  be  justified  ,for  example,  in  a 
character  recognition  scheme,  where  features  surh  as  t h «• 

'horizontal  top  line'  or  'curly  tail'  may  be  significant  for 
some  class,  whereas  the  slant  of  the  axis  of  symmetry  of  the 
letter  may  wary  with  writing  styles,  and  considered 
unimportant  tor  distinguishing  that  class. 

Scheme  (2)  does  not  assume  that  any  features  are 

unimportant  for  the  d measure,  but  it  does  associate  an  arc 

cost  with  every  edge  and  hence  trades  the  risk  of  nc>t 

finding  the  true  nearest  neigfibour  for  reduced  feature 
measurement  cost. 

Ihe  crucial  problem  in  using  the  admissible  search 
strategies  discussed  earlier,  is  that  of  comcuting 
lower/upper  bounds  on  the  d()  measure.  The  following 
sections  derive  bounds  for  various  metric  and  non~metric 
similarity  measures. 

• « 

2.A.i.2.  futlide an  Measure 

Assume  that  d(X,Y;f)  is  the  Euclidean  distance  between 
X,  Y in  tfie  space  of  features,  f • bs  is  the  set  of  samples 
represented  by  any  state  s,  and  Js,  the  information 

i 

} 

i 


b8 

associated  with  s.  We  shall  assume  that  Is  consists  of  the 
f u I I ou  t ng  : 

K.s  : f^ean  of  samples  in  Ds,  in  the  total  space*  f. 
fs  : features  measured  on  the  path  to  node  s. 

Rsmax(F')  : distance  between  Ms  and  the  sample  Y,  in  hs 
farthest  from  Ms,  in  the  space  F'. 

RsminlF")  : distance  of  sample  closest  to  Ms  in  F', 

From  the  nature  of  the  d measure, 

d(X,Ys;F)  d(X,Ys;F')  if  f'Cl  p , 

Lemma  : Min  d(X,Y;F)  >y  d(X,Ms;F')  - RsmaxlF") 

\ fe  Ds 

Hence  the  right  handside  of  the  above  inequality 
provides  a lower  bound  on  the  nearest  neighbour's  distance 
from  X,  from  among  the  samples  in  set  Ds« 

Proof : For  any  sample  V^Ds, 

d(X,Y;f')  ♦ d(Y,Ms;F')  d(X,Ms;F') 

from  the  triangular  inequality  property  of  d. 

Hence  , 

d(X,Y;F)  >^d(X,Y;F')  >^d(X,Ms;F'l  -d(Y,Ms;F') 

^d(X,Ms;F')  - RsmaxFF') 

Hence , 

Min  d(X,Y;F)  d(X,Ms;F')  - Rsmax(F') 

YtDs 

If  at  node  s,  the  information  stored  is  Ms,  and 
I Rsmax(Fs),  then  the  lower  bound  derived  above  can  be  used  in 

an  S -admissible  strategy  for  Scheme  (1)  defined  earlier.  The 
lemma  is  a g en e r a I i t a t i on  of  the  bound  proposed  by  Fukunaga 
[63  for  a b r an ch -and-bound  algorithm  for  determining  nearest 
neighbours.  Scheme  (1)  has  the  advantage  that  the  radius, 
Rsmax  has  to  be  computed  in  a single  dimensioned  feature 


space  <or  all  level  1 nodes  in  the  state-space  graph,  in  two 
dir>ensions  for  level  Z nodes,  etc.  Thus,  at  the  initial 
node,  where  Os  is  the  total  design  set,  the  computation  of 
Rsmax  is  the  easiest,  and  successively  becomes  more  complex 
as  the  number  of  samples  decrease.  In  the  scheme  described 
in  r6D,  the  radius  Rsmax  is  always  computed  in  the  total 
feature  space,  F,  and  hence,  far  more  computations  are 
involved.  Of  course,  scheme  (1)  is  valid  only  where  a 
subspace  representation  of  each  design  sample  is  justified. 

Lemma  : Min  d(X,Y;F)  d(X,Ms;F)  ♦ Rsmin(F) 

Hence  the  right  handside  is  an  upper  bound  ori  tf>e 
distance  of  the  nearest  neighbour  in  Ds  from  X. 

Proof  : Let  Y'  be  the  point  nearest  to  Ms  in  F,  then, 
d(X,Y';F)  < d(X,Ms;F)  ♦Rsmin(F) 

Hence, 

Min  a(X,Y;f)  ,5  d(X,Ms;F)  RsminfF) 

The  use  of  this  upper  bound  involves  computing  d in  the 
total  feature  space,  F,  and  all  features  of  X must  be 
measured.  Thus  this  is  not  suitable  tor  Scheme  (2)  where  we 
would  like  to  compute  the  upper  bound  without  incurring  this 
me  a s u r e m.e  n t cost.  However,  the  bound  can  be  used  in  scheme 
(3)  where  measurement  cost  is  assumed  to  be  zero.  Without 
some  other  knowledge  about  the  range  of  feature  values,  it 
is  not  possible  to  upper  bound  the  Euclidean  distance  for 
use  in  scheme  <3)  . 

?-«..3.3.  Ultrametric  Measure 

An  uttrametric  satisfies  the  property, 

d(X,Y)  Max  t d(X,Z),  d(Z,Y)  1 for  any  X,  V,  2, 


60 


and  hence  also  satisfies  the  triangular  inequality. 
Hence  the  same  tower  hound  as  defined  for  the  Euclidean 
measure  can  be  usedt  and  if  Ysp  is  any  prototype  sample  in 
Ds,  the  distance.  d(X.Ysp)  can  be  used  as  an  upper  bound  on 
the  minimum  distance  between  X and  ary  sample  in  Os. 

2. <..4.  Bounds  on  Similarity  Measures  For  Binary  Vectors 

Nearest  neighbour  schemes  can  be  used  for  samples  with 
non-metric  features  if  appropriate  distance  or  similarity 
measures  are  defined  between  samples.  If  a similarity 
criterion  is  used,  a sample  X is  classified  in  the  class  to 
which  the  sample  (design)  most  similar  to  it  belongs.  This 
section  considers  the  possibility  of  representing  subsets  of 
the  design  samples  say,  Ds , at  each  state  s,  in  a state 
space  graph,  by  'p r o t o t y pes ' . The  prototypes  at  s are  said 
to  'cover'  the  set  Os,  if  given  the  prototypes,  one  can 
generate  a set  0'  such  that  OsC  o'.  Thus  the  prototypes  play 
a role  simitar  to  the  mean  Ms,  and  radii,  Rsmax,  Rsmin, 


discussed  in  t he 

nearest  neighbour 

schemes 

using 

the 

Euclidean  distance. 

Once  such 

prototypes  are  generated. 

one 

would  like  to  bound 

the  value 

of  the  similarity 

measure  , 

say 

S(X,Y)  between  X and 

any  Y e Ds 

, by  using 

the  information 

i n 

the  prototypes  . 

Consider  the  case  when  the 

samp  le  s 

are  N 

bit  binary 

vectors.  A prototype  vector. 

t will  be 

ass  umed 

to  be 

an 

N-bit  vector  whose  bits  can  have  values,  'O',  '1',  or  '-' 
(don't  care)  . The  set  0(t)  is  the  set  obtained  by 
replacing  the  '- ' bits  in  t by  all  possible  permutations  of 
'O's  and  '1'  s.  Thus,  if,  t=(-0-10),  then  Oft) 
=(00010,00110,10010,10110)  . Methocs  of  generating  good 
prototypes  for  the  design  samples  f r cm  each  class  for  the 
discrete  feature  case,  are  discussed  in  (13,15). 

The  model  for  NN  search  using  S(X,Y),  a similarity 


61 


measure  between  binary  vectors,  is  ioentical  to  that  for  the 
nearest  neighbour  schemes  discussed  earlier.  The  task  here, 
is  to  maMimiie  the  Value  of  S(X,Y).  The  information  Is  at 
node  s is  assumeo  to  be  a set  of  prototypes,  fts(i)}  f 
i=1t2*...ps,  such  that* 


Ds  d 0(ts(1))U0(ts(?))U 


U D(ts(ps>) 


Dfc  fine 

Sl(X,t)  = Min  S(X,V)  and  Su ( X , t ) = Max  S(X,Y) 
Yt  DCt)  ye  r>Ct) 

Then,  it  follows  that, 


Min  S (X  ,Y  ) 

Min  C Sl(X,ts(i)) 
‘ ^ t Ps 

3 

Max  S (X  ,Y  ) 

Max  [ Su(x,ts(i)> 
1^  Ps 

3 

The  bounds  , SI  , 

Su  , 

for  various  types 

of  measures. 

derived  in  the 

next 

sect  ion  . 

2.4. 4.1.  Bounds,  SI,  and  Su 

The  following  notation  is  used  ir  the  discussion  in  this 
sect  ion . 


X,  Y rtest  sample  and  a design  sample  respectively. 

Both  are  binary  vectors. 

B : the  number  of  bits  in  X,  Y. 

S(X,Y):  similarity  criterion  used. 

t : a prototype  vector  whose  bits  can  be  0,1  or 
u(t)  : the  number  of  ' - ' bits  in  t. 

n1(t):  the  number  of  '1'  bits  in  t. 

n1(X):  the  number  of  ' 1 ' bits  in  any  binary  vector,  X. 

m(X,Y):  the  number  of  matching  bits  in  vectors  X,  V. 
ffl(X,t):  number  of  matching  bits  in  X and  prototype  t. 


62 


where  only  matches  in  positions  at  which  t has  0/1 
bits  are  considered* 

n1(*,-):  number  of  '1's  in  X in  positions  where  t has 
a ' bit. 

1he  following  lemmas  derive  bounds  Sit  Su  in  terms  of 
the  above  de f i neo  quant i t ies  for  three  types  of  similarity 
measures  commonly  usedCID* 

lemma  : Let  S(X,Y)  = m(X,Y)  / N 

Then,  SKx.t)  = Min  S(X,Y)  = «(X,t)  / N 

ye  \)lt) 

and  Su(X,t>  = Max  S(X,Y)  = {m(X,t)  * u(t))  / N 
v/eDCt-) 

Proof:  The  proof  follows  from  the  fact  that  S can  be 
minimised  (maximized)  by  choosing  the  bits  in  t to  be 

opposite  (the  same  as)  the  cor responoing  bits  in  X. 


) 


lemma  : Let  S(X,Y)  = N.  m(X,Y) 


n1 (X ) *n1 ( Y ) 


Then, 
SI  (X 


,t)=MinrN.(m(X,t)  ♦ n1(X, 
L nl(X) *{n1 (t)+u( 


-)) 
u(  t )) 


N.m(X,t ) 


n1 (X) *<n1 ( t )^u( t ) 
-n1 (X,-)3 


-] 


Su(X,t  )=Max|  N ,(  m(  X , t ) 

nTTV) 


4 u(t)  -nl(x,-))  iiL^ltn(Xi-L)  .1.  U.(  t))'] 

50  . nirt)  * nnx).lnl(t)4  J 


♦n1 (X,-) ) 


Proof : Let  Y be  any  vector  in  0 (t ) , and  let, 

X * K of  '1's  in  Y in  the  positions  of  t, 

y * of  matching  bits  between  X,  V in  the  ' - ' 
pos  itions  of  t. 

Then  for  this  vector  V, 


T 


M . m ( X > Y ) 

S(X,Y)  * 

n1  < X ) ,n 1 ( Y ) 


N.  [ in  (X  , t ) ♦ y 3 

n1 (X  > .Lnl  ( t) **3 


(?.10) 


63 


equation  (2.10)  shows  that  for  a given  value  of  *,  i .e . 
the  number  of  '1  's  in  Y at  the  positions  in  tf  y can  be 

minimized,  and  hence,  S,  by  assigning  as  many  of  these  ones 
to  positions  at  which  X has  zeroes,  thus  causing  maximum 
mismatch  between  X,  Y,  In  particular,  y=0  if  x= 
u( t ) -n 1 ( X , - ) , and  the  ones  in  Y are  assigned  in  this  manner^ 
Decreasing  x below  this  value  would  cause  y to  increase; 
hence  the  minimum  value  of  S must  te  attained  in  the  x 
range  , 

u(t)  -n1(X,-)  >5  * ><  u(t) 

and  y = x-L  u(t)  -n1(X,~)  3 f 2 • f 1 ) 


Substituting  for  y in  (2.10)  using  (2.11)  and  minimizing 
over  the  specified  range  of  x gives  the  desired  result  for 
SI  . 


for  maximizing  S(X 

>t 

it  is  noted 

that 

for 

a given 

number  of 

mate  he s y,  x 

will 

be  minim 

iz  ed  , 

and 

S 

maximi zed  , 

by  making 

the  matches 

occur 

at  the 

'O's 

in  X , 

a s 

much  as 

poss ib le . 

For  y=  u(t)  - 

n1  (X 

,-) , X w i 

1 1 be 

zero 

i f 

Y has  all 

■'O's  in  the  positions  in  t.]f  y is  decreased  below  this 

value,  X can  only  increase.  Hence,  the  range  of  y is, 

u(t)  -nKX,-)  x<  y ^ u(t)  and, 

X = y-C  u(t)  - n1(X,->  3 (2.12) 

Substituting  for  x in  (2.10)  using  (2.12)  and  maximizing 
over  the  above  defined  y range,  gives  the  desired  value  of 
Su. 


61. 


t » d m p I c } : 

Consider  a sample,  X,  and  prototype  given  by, 

X = ( 101101C010  > 
t = ( 01010  ) 

Here  N=10  o(t)=5  n1(X)=5  n1(t)=2  m(X  t)=3  n1(X,-)=3. 

Min  S(X,V)=  Mine  10. (3*3) /5 . C2*5>  , 1 0 . 3 / 5 . ( 2 ♦ 5-3 ) ) = 1.5 
YtDCt) 

Max  S(X,V)=  Max{10.(3*5-3)/5.2  , 1 0 . ( 3 + 5 ) / 5 . ( 2* 3 > > = 5.0 

\c  \>M 


i emma  : Let  SCX,Y)  = m ( X , V ) / [ n 1 ( X ) ♦ n 1 f Y ) -m { X , Y ) 1 

Then, 


SI  ( X , t ) = 


m ( X ,t  ) 

n1(X)  ♦n1(t)  -m{X,t)  -n1(X,-)  ♦ u(t) 


SutX,t) 


m(X,t)  * uft) 

n1(X)  ♦ n1(t)  - m(X,t)  ♦ n1(X,-)  -u(t) 


Proof:  Defining  x,  y as  before,  S(X,Y)  becomes. 


mf  X , t ) ♦ y 

S ( X , Y ) (2.13) 

nUX)  *n1(t)  -m(X,t)  «x  -y 

from  (2.13),  it  is  observed  that  S will  be  minimized  for 
a given  x by  minimizing  y,  and  this  can  be  done  by  assigning 
the  '1's  in  Y to  positions  where  X has  '0'  s.  A similar 
argument  can  be  applied  for  maximizing  S.  The  ranges  for  x, 
y and  their  relationship  are  identical  to  (2.11)  and  (2.12) 
derived  in  the  previous  proof.  By  minimizing/  maximizing  S 
over  these  ranges,  one  gets  the  desired  result. 


fxample  2:  Using  the  same  example  as  in  Example  1 


hb 


X=  ( 1011010010  ) 

t=  ( 01010  ) 

Min  S(X,V)  = 3 /(542-3-3-»5)  = 0.5 

MdK  S(X,Y)  = 3-»5  / (5+2-3+3-S)  = 4,0 

V e Dco) 

The  above  derived  bounds  show  that  for  many  nontt  ivial 
parametric  and  non parametric  schemes,  one  can  obtain  bounds 
relatively  inexpensively,  and  thereby  cut  down  the 

measurement  cost  or  the  number  of  distance  computations  (for 
nonpa r ame t r i c ) needed  to  classify  a test  sample.  Ihe 

generality  of  the  model  has  allowed  us  to  describe  new  types 
of  nea res t -ne i gh hour  schemes  which  use  distance  measures  in 
sub  spaces  of  the  total  feature  space  to  find  the  design 
sample  'most  similar'  to  the  test  sample.  The  next  section 
describes  a particular  type  of  state-space  graph,  viz.  the 
hierarchical  classifier,  as  a useful  model  that  restricts 
the  generality  of  the  state-space  mocel  and  thereby  improves 
the  search  efficiency  for  practical  multiclass  recognition 
problems. 

2.5.  hierarchical  Classifiers 

A state  space  model  is  a very  general  representation  of 
any  multistage  scheme  in  that  one  could  have  a model,  where 
at  any  node  tk,  which  is  k nodes  away  from  the  initial  node, 
one  could  observe  any  of  the  N-k  features  not  observed 
(assuming  N features  are  available,  and  at  each  node,  one 
observes  one  feature);  alternatively,  one  could  classify  the 
sample  into  any  of  the  M classes.  The  figure  below  shows  the 
possible  successor  nodes  of  tk. 


66 


In  the  dl'ove  diagram,  each  state  which  is  not  a gOdI 

state,  is  shown  as  a tuple,  (fi,Wi).  Since  ti  can  be  any  of 

N-k  features  that  have  not  been  observed  so  far,  and  Wi  is 

any  subset  of  the  t*  classes,  the  nuirber  of  such  non-goal 

•i 

states  is  (N-k).2  . In  addition,  one  has  the  possible 

goal  states,  hence  the  total  number  cf  successors  of  a node 
k levels  deep  in  the  graph,  is: 

H 

W ♦ (N  - It ) . 2 

While  one  could  still  define  admissible  search 
strategies  for  such  a general  graph,  the  search  efficiency, 
defined  as  the  average  number  of  nodes  e x pa nded ( mea su reme nt s 
taken)  to  classify  a sample,  would  be  poor. In  a practical 
design,  therefore,  one  might  want  to  use  prior  information, 
such  as  the  usefulness  of  certain  features  in  discriminating 
subsets  of  class ses,  to  limit  the  graph,  fhe  graph  could  be 
restricted  either  by  restricting  the  possible  successor 
nodes  of  a node,  or  by  limiting  the  states  in  some  way. 

A hierarchical  classifer  is  a particular  kind  of  model, 
wherein  tfie  states  and  possible  transitions  (edges)  in  the 
graph  are  explicitly  defined.  The  restriction  put  on  the 
states  is  that  if  (fk,Wk)  dertotes  slate  k,  and  if  state  t is 
a successor  of  state  k,  then. 


Ut  Cl  wk 


•1 


I 


h* 

I, 


6 7 


This  condition  implies  that  the 
labels  considered  when  t is  reached, 
those  considered  possible  at  any  o 
Hence,  the  decision  process  can  be  r 
each  node  of  which  the  set  of  pos 
partitioned  into  subsets  (perhaps  o 
Thus  along  any  path  in  the  tree,  at 
class  is  being  'rejected'  from  c 
'rejected'  is  being  qualified  becaus 
that  an  admissible  strategy  might  ex 
and  then  back  up  to  expand  some  node 
some  other  path.  All  the  propertie 
B-admissible  strategies  can  be  made 
tree  of  decisions.  An  example  of  sue 
and  all  the  states  are  indicated  exp 


The  use  of  tower  and  upper  bound 
strategies  has  been  explored  in  this 
the  state  space  graph  is  small  and  e 
the  above  figure,  one  can  obtain  a 
graf'h  searching  problems,  a perfect 
can  estimate  the  path  length  from  an 


set  of  possible  class 
is  a proper  subset  of 
t its  ancestor  nodes, 
tp resented  as  a tree,  at 
sible  class  labels  is 
verlappinq)  of  (at)ets. 
each  node,  at  least  one 
cn s i de r a t i on , The  term 
c it  is  quite  likely 
pand  nodes  on  some  path 
(in  the  OPEN  list)  on 
s of  S-admissible  and 
use  of  in  searching  this 
h a tree  is  shown  below, 
I i c i t I y . 

3,w4,w5> 

=fw?,w4,w5) 

= (w2,w(i) 

={wf,w2,w3> 

={w2,w3) 


s in  formulating  search 

c hapt  e r . 

However, 

when 

xplicitly 

de  f i nod  , 

as  in 

perfect' 

heur i s t i c 

. I n 

he  u r i s t i c 

is  one 

which 

y node  to 

the  goat 

n od  e 

6fi 


riaitly.  Thus  to  follow  the  shortest  path  in  the  graph,  on<> 
simply  starts  at  the  initial  node,  computes  the  heuristic 
function  for  all  its  successors,  and  goes  to  that  node  whose 
heuristic,  added  to  the  cost  of  that  arc  (between  (he 
current  node  and  it)  is  a minimum.  This  concept  can  tre 
extended  to  graphs  where  the  goal  has  a risk  which  is  a 
function  of  random  variables  with  a known  d i st r i but  ion  . 

t.6.  Hart's  Probabilistic  Decision  Tree  Model 

The  state -space  model  analyzed  in  this  chapter  is 
closest  in  spirit  to  the  probabilistic  decision  tree  of 
HartllBT.  Hart  describes  a Bay e s-ad n i s s i b I e search  strategy 
(or  a decision  tree  such  as  shown  in  Fig.  2.10  below. 


Fig.  2.10 


Each  edge  in  the  binary  tree  denotes  a 'state  of  nature' 
and  the  terminals  represent  a joint  state  of  nature 
comprising  of  the  states  on  the  path  to  that  terminal  from 
the  root.  Thus,  wl^facg),  w2=fach>,  etc.  Let  X denote  the 
total  set  of  measurements  taken  on  the  sample,  and  Xi, 
i=1,2..n,  the  measurements  taken  on  the  path  through  nodes 
1,2,..n,  assumed  to  lead  to  node  n (say). Let  Bn  be  the  joint 
state  of  nature  comprised  of  <01, 62, ..On)  which  are  the 
edges  along  this  path. Then  the  a posteriori  probability 
that  one  of  the  joint  states  below  node  n has  occurred  qiv>»n 


the  me  a s Li  r e m en  t vector,  X,  is,' 


oV 


J 


f (n ) = p(On/X)  . 

Hart  makes  three  assumptions  regarding  X and  6n , vi/., 

(a)  ancestor  dependence,  i.e.  p(0n/X)  depends  only  on 
the  measurements,  Xl..Xn  on  the  path  to  node  n. 
Hence, 

p(en/Xi  p (0n/ X 1 ,X2  . .Xn  ) . 

(b)  conditional  independence,  i.e, 

•n 

p{  XI  , . ,Xn/6n  ) = TT  p(Xi/6i)  . 

Ic)  independence,  i.e.  ^ 

_ n ’ 

p(0n)  =TTp(0i)  and  p(Xl,.,.Xn)  =1^p(xi)  . 

i=i 

Under  these  conditions,  the  function  f(n)  which  is  used 
by  Hart  to  order  the  OPEN  i>odes,  can  be  written  as  a 
produc  t , ^ 

f(n)  = p(0n/X)  =Tlp(0i/Xi)  . 

from  the  property  that  f(n)  is  non-decreasing  on  any 
path  in  the  tree  and  along  the  sequence  of  nodes  expanded  by 
his  algorithm,  he  proves  that  the  algorithm  is 
Bayes-admi ssib  le  . 

In  relation  to  the  three  types  of  strategies  discussed 
in  this  chapter,  it  follows  that  Hart's  strategy  falls  in 
the  category  of  an  S-admissible  strategy.  Under  the 
assumptions  (a)-(c),  the  function  f ( n)  is  an  upper  bound  on 
the  a posteriori  probability  of  any  class  under  n.  Ue  have 
proved  the  admissibility  of  such  strategies  under  more 
general  conditions  in  Theorem  2.  In  particular,  we  consider 
the  case  where  measurement  cost  is  an  important  factor  (Hart 
assumes  zero  costs).  Moreover,  we  have  shown  that  for  more 


general  models  oi  leature  dependence  such  as  the  tree 
dependence,  one  can  derive  lower  bounds  on  the  risk  for  use 
in  an  i-admissible  strategy. 

Hart  states  that  in  general  casesli.e.  without  making 
assumptions  such  as  <a>-(c)  above),  it  may  be  difficult  to 
find  baye s-adm  is  si b I e strategies  that  do  not  expand  the 
enl  iie  search  tree,  This  conjecture  has  been  disproved  in 
our  woi k where  we  have  shown  that  a B-admissible  strategy 
car\  be  defined  for  very  general  situations  (viz.  where  tfie 
risk  depends  on  ALL  measurements,  not  just  those  on  the  r-atf! 
to  a particular  goal).  The  feasibility  of  computing  bounds 
for  use  in  such  strategies  has  been  demonstrated  for  many 
parametric  and  non-par amet r i c schemes. 


71 

5.  f’roperties  Of  Hierarcfiicat  Class'*  Hers 

This  choptfr  investigates  so  ire  of  the  theorefital 
j>rop<‘rtips  of  hierurthical  classifiers.  Various  measures  of 
tree  performance  are  discussed.  Emphasis  in  this  chapter,  is 
placed  on  trees  whose  node  decisions  are  class  conditionally 
statistically  in oe pendent.  When  this  assumption  is  invalid, 
one  can  estimate  upper  bounds  on  the  e ' r or  made,  in  terms  o'* 
the  total  tree  error  for  any  class,  and  the  error  of  any 
node  classifier. 

Two  'negative'  properties  of  hierarchical  classifiers 
make  the  optimal  design  problem  complex  even  in  those  cases 
where  the  independence  assumption  holds.  The  first  property 
states  that  in  such  cases,  optimiiing  each  node's 
performance  (Bayes  risk)  does  not  minimize  the  total  tree 
risk.  The  second  property  states  that  choosing  at  each  node 
of  a given  tree  structure,  that  feature  which  makes  the  node 
decision  with  the  least  average  error,  is  not  necessarily 
the  optimal  assignment  of  features.  The  conclusion  is  that 
even  for  statistically  independent  node  decisions,  one 
cannot  optimize  the  tree  performance  by  optimizing 

individually,  the  performance  of  each  classifier  used  at  the 
tree  nodes. 

The  following  notation  will  te  used  in  subsequent 
discussions  of  trees.  Consider  the  tree  shown  below  in  fig. 

3.1  . 


7^ 

1.1.  Notation 

are  nonterminal  node  labels,  11  is  the  root. 
w1,u?..  are  terminal  node  labels  called  classes  and 

refer  to  the  type  of  classification  performed 
uhen  the  decision  path  leads  to  that  node. 

M = total  number  of  classes 

f = the  total  set  of  features 

N = the  number  of  features  in  the  set  f 

fi  ~ the  i th.  feature  name,  ie{1,?,...N) 

for  example,  "colour'. 

mi  = number  of  states  of  fi  (if  discrete) 

xi  = a random  variable,  representing  an  observation 
of  feature  fi,  e.g.  'red', 
tlj  = feature  (name)  used  at  rode  Ij 

ml)  = states  of  featurefif  discrete)  used  at  1). 

xl)  = ranoom  observation  of  feature  used  at  node  1). 

p ( X 1 , X ? , . . X n/ tti ) probability  of  observing  values  »1,..xn 
of  features,  f1,f2,..fn,  if  sample  is  from  wi. 
P(wi)  apriori  class  probability  of  class  wi. 

W(lk)  set  of  terminal  class  labels  below  node  Ik, 

e.g.  in  Fig.  3.1,  U ( I 2 ) - <w 1 , w 2 , w 3 , w 4 , w 5 } . 
for  a binary  tree, 

W0((k),  Wl(lk)  are  the  terminals  of  the  (eft  and  right 
subtrees  below  Ik,  e.g. 

Un(l1)=(w1,w2,w3,w4,w5>,  W2(l1)-(w6,w7) 

S(wi)  set  of  node  labels  on  a path  to  class  wi  from 

the  root,  e.g.  S(w4)  = (l1, 12,15) 

Prfwi/lk)  probability  that  a correct  decision  (e.g.  to  go 
'left'  or  'right'  in  a binary  tree)  will  be  made 
on  a random  sample  from  class  wi  which  arrives 
at  node  Ik.  In  general,  Pc(wi/lk)  is  a function 
of  the  features  used  along  the  path  to  tfiat  node 


74 


and  decision  strategy  at  all  previously  traversed 
nod  es  . 

A decision  tree»T  is  composed  of  a root  node  Ilf  a set 
of  nonterminal  nodest  lli}*  and  terminal  nodes  labelled  with 
the  class  labels  the  same  class  label  may  occur  at 

more  than  one  terminal  node.  The  sample  to  be  classified 
undergoes  a sequence  of  t e s t s ( d ec  i s i cn  rules)  on  the  path 
from  the  root  to  a terminal  nodCf  at  which  point  it  gets 
classified  as  belonging  to  that  category.  The  choice  of  the 
next  test  to  be  performed  (next  node  traversed)  depends  upon 
the  last  test  donet  or  in  general*  upon  all  measurements 
taken  on  the  sample. 

3.2.  Performance  Of  A Decision  Tree 


3.2.1.  Probability  Of  Correct  Recognition 

Consider  a tree  such  as  the  one  in  Fig.  3.1.  If  the 
features  used  at  the  nodes  are  statistically  independent  and 
if  at  each  node*  the  decision  is  a function  of  only  that 
particular  feature  observationt  then  the  probability  of 
correct  recognition  of  some  class*  wit  is  the  product  of  the 
correct  recognition  probability  of  that  class  at  all  nodes 
on  the  path  leading  to  that  terminal  nude*  i.e. 

Pc(wi)=  IT  Pc(wi/lk)  where  Ik C S(wi) 

Ik 

Hence*  the  total  tree  perlormancct  is  g*ven  by  the  class 
recognition  rates*  weighted  by  their  apriori  probabilities* 

t .#  . 


75 


Pc(T)=^P(  ui  ).T1  Pc(wi/U)  ie{1,2,...C>,  UeS(wi) 
ilk  ..  (3.1  ) 

Ihe  average  correct  recognition  rate  at  a nodet  Ik,  for 
all  samples  from  the  set  of  classes  that  lie  below  it, 
U(lk),  is, 

Pc(lk)=  l^P(wi  ).Pc  (wi/lk)  3 /^P(wi)  ....(3.2) 

i i 

where  wi£.W(lk), 

Thus,  Pc<lk)  is  a linear  function  of  the  class 
recognition  rates,  Pc(wi/lk),  whereas  Pc(T)  involves 
products  of  these  rates.  The  design  problem  is  to  find  a 
tree,  1 ' such  that, 

Pc(T')=  ma*  Pc(T) 

Vt 

3.2.2.  Other  Measures  Of  Tree  Performance 

While  the  probability  of  error  is  a good  measure  of  tree 
performance,  it  is  often  difficult  tc  compute  for  arbitrary 
feature  distributions.  For  two  class  cases,  several  bounds 
on  the  Hayes  error  h<»ve  been  suggested  183.  In  this  section, 
we  consider  the  use  of  such  measures  (bounds)  to  bound  the 
performance  of  a decision  tree.  Let  d(wi,wj;F)  be  some 
measure  of  separability  of  classes  wi  , wj  using  features 
F.If  wi,  wj,  are  among  a set  of  M classes  distinguished  by  a 
tree,  I,  we  wish  to  define  the  separability  of  the  classes 
as  a function  of  the  tree.  Let  Ik  denote  the  node  at  which 
the  two  classes  get  separated,  and  let  FkciF  denote  the  set 
of  features  used  at  that  node.  If  a binary  tree  is  assumed, 
one  could  define  the  separation  of  wi,  wj  by  T,  in  any  of 


76 


the  following  ways. 


D(wi|Wj;T)  = d(wi,wj;Fk)  .............. ...(3.3a) 

D(witwj;T)  = i n d(wSfwt;Fk)  ........(3.3b) 

V/s,W-t 

where  wt^WO(lk),  ws€W1(lk)  . 

D(wi,wj;T)  ^Cst.d(ws»wt;Fk)  ........(3.3c) 


where  cft  ~are  weights  and  ws€U0(lk)t  wtCWl(lk). 
Uo(tk),  w1(lk)t  are  the  sets  of  classes  distinguished  at 

Ik. 

The  total  tree  performance,  Pc(T)  could  be  defined  as, 
Pc(T)  ^^D(wi,wj;T)  (3.4) 


3.3.  Error  in  Assuming  Sum  of  Products  Form  of  Pc(T) 

It  has  been  assumed  in  the  previous  discussions  that  the 
total  performance  of  a tr^e,  Pc(T)  can  be  written  as  a sum 
of  produc  t s , as, 

Pc(T)  = ’y" P(wi).JJpc(wi/lk)  (3.5) 

Equation  (3.5>  is  exact  when  the  features  used  at 
different  nodes  are  c I a ss -cond i t i ona  1 1 y statistically 
independent.  Thus,  statistical  independence  is  a sufficient 
though  not  necessary  condition  for  equation  (3.5)  to  be 
exactly  equal  to  the  true  tree  performance.  In  this 
section,  a more  general  set  of  conditions  is  derived  for 
(3.5)  to  be  exact.  Also  in  those  cases  when  these  conditions 
do  not  hold,  an  upper  bound  on  the  error  (in  assuming  the 


77 


product  form)  is  derived  in  terms  of  two  bounds: 

• an  upper  bound  on  the  error  rate  (true)  of  any  class 
by  the  t re  e* 

. an  upper  bound  on  the  error  rate  of  any  rule  used  at  a 

node  for  a sample  from  some  classt  wi  • 

Equation  (3.5)  is  not  accurate  because  the  quantity 
Pc(wi/IK)  is  the  performance  on  a sanple  from  wi  distributed 
according  to  the  distribution.  say*  p(X/wi)  in  nature. 
However,  when  this  rule  is  used  at  the  k th.  node  in  a tree, 

the  distribution  of  the  class  wi  samples  that  arrive  at  Ik 

is  ^0T  p(X/wi),  in  general,  but  some  other  distribution, 
say,  p'(X/wi),  This  is  due  to  the  samples  that  are  directed 
away  from  the  path  to  wi  by  the  rules  used  above  node  Ik  in 
the  tree.  Let  Pc(T)  be  the  tree  performance,  Pc(wi/T)  the 

correct  recognition  rate  for  class  wi  samples  using  the 
tree,  T,  and  ^c(7),  the  product  sum  cn  the  right  hanpside  of 
Eqn.  (3.5).  Then  one  seeks  a bound  on, 

\pc  (T)  - ^c(T)\  (3.6) 

where,  Pc(T)  = P(wi).Pc(wi/T) 

The  bound  (3.6)  may  be  derived  by  first  bounding  the 

quantity, 

)pc(wi/T)  - TTpc(wi/U)|  (3.7) 

SCc..:) 

for  convenience,  let  It,  12,  ..Ini  he  the  nodes  on  the 
path  to  wi  in  the  tree,  as  shown  below. 


78 


The  total 

popu 1 a t i on  of  w i 

can  be  divided 

into 

disjoint 

populations,  w 

i?.  those  that  trickle  down  to 

node 

those  that  do 

not.  Define. 

pc i ( 1 k ) = 

correct  recognition 

rate  of  rule 

a t 

Ik 

on 

poput  at  ion 

of  w i samp  1 es  that 

r e ac  h Ik  in 

the 

tree. 

pri(lk)  = 

correct  recognition 

ra  te  of  rule 

a t 

Ik 

i f 

were  used 

on  the  samples  that 

do  NOT  reach 

Ik 

• 

Pi (k-1 ) = 

fraction  of  the  total  population 

0 f 

c 1 

ass 

that  reaches  Ik  via  It.  12.  . 

Then,  sine 

e the  two  populations  defined 

earl 

i e r 

disjoint , 

Pc(ui/lk) 

= pci(lk).  Pi(k-t) 

♦ pri(lk) . C1-Pi(k 

- 1 ) ) 

• • 

(3 

.8) 

Also,  the 

Pi(k-t)  are  related 

as  f ol lows  : 

Pi  ( 1 ) = Pc  (w  i/  1 1 ) 

Pi(?)  = pc 

i( 12) .Pi (1  ) 

Pi(3)  = pc  i ( 13 ) . P i ( 2 ) etc. 

Hence  by  back  substitution,  one  cbtains. 

V\; 

ri(ni)  = Pc(wi/T)  =TTpci(lk)  11)/U> (3.9) 

K.l 

where  pci(l1)  = Pc(wi/ll)  . 


two 

anr) 

that 
i t 

w i 
are 


79 


Substitutinq  for  pci(lk)  from  equation  (3.8)  into  (3.9), 

Pcfwi/T)  ^ yjpc(wi/U)t1-  pr  i ( U ) . (1 -Pi  (k -1  ) ) /Pc  (w  i /U  ) J 
K=i  


P i( k-1 ) 


Ue  can  then  define  the  ratio  of  the  true  and  estiinated 
values  of  Pc(vi/T)  as, 

a: 

Ri  ^ Pc(wi/T)  = ttCI  - pr i ( I k )( 1 -Pi  ( k -1 ) ) /Pc (v i / I k )] 


n 


TTpc (w  i/  Ik  ) t . 

K.  • 


Pi (k-1 ) 


(3.10) 


From  equation  (3.10)  it  follows  that  Ri  =1,  i.e*  the 
product  form  for  Pc(wi/7)  will  be  exact  »ift 


pri(lk)  = Pc(wi/lk) 

= pci(lk)  for  k £ 1 , 2 , 3 . . ..  n i 


(3.11) 


The  above  result  states  that  if  the  average  performance 
of  the  rule  at  Ik  on  the  samples  rejected  prior  to  reaching 
Ik,  is  the  same  as  that  on  the  samples  reaching  Ik,  then  the 
product  form  will  be  exact.  Moreover,  the  extreme  values  of 
Ri  occur  for  pri=1.  and  pri=0.  If  pri=0.0  is  true,  then  the 
product  form  will  be  pessimistic,  i.e. give  too  low  a value 
of  Pc(ui/T),  while  if  it  is  1 , it  will  be  optimistic,  i.e. 

will  result  in  too  large  a value  of  the  estimated  Pc(wi/T). 
Thus  Ri(max)  is  greater  than  1.  and  Ri(min)  is  less  than  1. 


n; 

Pc(wi/T)  /Ri(max)  < "f^Pc(wi/lk)  < Pc(wi/T)  /Ri(min) 
Hence, 

Pc(wi/T)  - Pc(wi/lk)  < Pc(wi/T){  1/Ri(min)  - 1/Ri(max)> 


=Pc(wi/T) 


LUl 


1 - (.1  - Pi(U-l) ) 
Pc(  wi/lk)  ' 


-IJ.. 


( k-1 ) 


Cs-ii) 


80 


lo  ‘implify  the  above  enpression,  we  shall  put  bounds  on 


the 

error  rate  for  each 

class 

by 

the  tree,  aruf 

by  any 

part 

•cutar  node.  Let, 

( 1 - Pc (wi /I k)  ) < 

e 1 ( i ) 

( 1 - Pc (wi  /I  ))  < 

e ( i ) 

Then,  1 -Pi (k -1 ) < 

e ( i ) , 

and 

he  nee,  equation 

(5.12) 

can 

be  written  as. 

_n; 

|Pc(vi/T)-7]pc(yi/U)|<  Pc(wi/T)  [|l-  e ( i ) / ( 1 -e  I ( i ) ) J - 1] 


- Pc(wi/T)  .ni*  e(i)  /(1-el(i>) 
(3.13) 


if  e(i)/  (1-el(i)  is  small  in  coirp arisen  to  1 . 
Substituting  in  equation  (3.6)  using  (3.13)t  one  gets, 
|pt(1)  - Pc(T)|  < y^P(  wi  ) .Pc  ( wi/T  ) .n  i .e  ( i ) /(1-e  I ( i ) ) 


Consider  the  special  case  of  a balanced  tree  with  n 
levels  in  which  the  upper  bounds,  e(i)  and  el(i>  are  the 
same  for  all  classes.  Then  the  percentage  error  in 
estimating  Pc(T)  as  the  product  sum  on  the  right  handside  of 
equation  (3.5),  is  given  by. 


1 PC  (T  i - Pc  (T  )1 


n . E t 

- Tl 


Pc  (T  ) 


< 


1 


(3.15) 


81 


In  (3.15)  t fct  Ss  the  opper  bound  on  the  error  rate  <of 
any  class  by  the  total  tree,  El  is  the  upper  bound  on  the 
error  rate  tor  any  class  by  any  SINGLE  node's  decision  rule, 
and  n is  the  number  of  levels  in  the  tree.  E qu . (3.15)  shows 
that  the  error  in  the  estimate  increases  linearly  with  the 
levels  in  the  tree  and  the  class  errcr  rate. 


3.4.  A Property  Of  The  Optimal  Decision  Policy 

In  a qerieral  decision  scheme,  the  decision  at  a node, 
Ik,  regarding  the  next  node  to  be  traversed,  is  a function 
of  all  the  measurements  taken  on  the  path  to  Ik  from  the 
root.  Consider  as  a special  case,  those  decision  policies 
which  depend  only  on  the  last  feature  measured  (abbreviated 
as  oS,  for  one-step  policies).  Equ.  (3.2)  provides  a means 
of  choosing  a policy  which  maximizes  the  average  correct 
recognition  rate  at  a node,  Pc(lk).  This  optimal  rule  is 
called  a 'one-step  maximum-likelihood'  rule  (abbreviated 
OSMLR),  and  given  by, 


OSMLR:  If  the  measured  feature  value  at  Ik  is  xlk=s, 

if, 

^P(wi).p(xlk=s/wi)  > (wj).p(xlk=s/wj) 

then  go  left, 
else  go  right. 


I 


In  the  above  formulation,  a binary  decision  is  assumed 
at  Ik,  but  a similar  maximum  likelihood  rule  may  be  written 
for  the  case  when  Ik  has  m (>2)  descendant  nodes. The 
property  described  below  shows  that  using  an  OSMLR  rule  at 
each  node  does  not  necessarily  result  in  the*  optimal  tree 
performance,  Pc(T). 


4 


8? 


Property  1: 


Using  a one-step  maKimum  likelihood  rule(OSMLR)  at  each 
node  of  a 1 ree »T  , does  NOT  necessarily  result  in  the  optimal 
tree  performance,  Pc(T),  across  all  possible  choices  of 
one -step  (OS)  rules*  This  property  holds  even  when  the 
features  are  statistically  independent. 

To  see  why  this  property  holds,  consider  the  binary  tree 
of  Fig. 3.?,  in  which  feature  fi,  is  assumed  to  be  measured 
a t nod  e I i . 


if  an  OSNLR  rule  is  used  at  each  node,  the 
performance,  and  correct  recognition  rates  at 
are. 


total  tree 
the  nodes , 


Pc(T)=P(w1).Pc(w1/l1).Pc(w1/l2)+P(w2).Pc(w2/l1),Pc(w2/l2) 


♦ P(u3).Pc(w3/l1).Pc(w5/l3)*P(wA).Pc(wA/l1).Pc(w<i/l3> 


Pc(l1)=P(w1).Pc{w1/l1)4P(w2).Pc(w2/ll) 

♦ P(u3).Pc(w3/l1)*P(w4).Pc(w4/n) 

Pc(l2)=P(w1) .Pc(w1/l2)^P(w2).Pc(w2/l2)  / t P ( w 1 ) ♦ P ( w 2 ) ] 


Pel  l3)=P(w3)  *Pc(w3/l3)*P(w<.).Pc(w4/l3)  / t P ( w 3 ) *P  ( w<.  ) ] 


let  *?-s  be  some  state  of  feature  f2,  such  that. 


83 


f»(  ) .p  (*2  =s /w  1 ) > P(w?).p(x2=s/w2) 

Then  the  OSMLR  would  classify  a sample  with  x2=s  as 
belonging  to  class  w1.  Now,  suppose  we  choose  the  OPPOSITE 
of  the  ma X i mum -I  ik e I ihood  rule  and  assign  such  a sample  to 
w2.  Then  the  change  in  Pc(l2)»  the  recognition  rate  at  node 
2 , is, 

d(Pc(l2>)  =P(w2)p(x2=s/w2)-P{w1)p(x2=s/w1)/CP(w1)*P(w2)] 

< 0 

Hence  the  node  performance  decreases,  as  expected.  The 
change  in  the  total  tree's  performance  is, 

d(Pc(T))=  P(w2)Pc(w2/l1)p(x2=s/w2) 
-P(w1)Pc(wl/l1)p(x2=s/w1) 

The  tree  performance  would  IMPROVE  if, 

p(x?=s/w2)  P (w1 ) .Pc  (w1  / 1 1 ) 

> . ■ ■ . 

p(x?  = s/w1)  P (w2) . PC (w2/ 11  ) 

Hence,  using  a one-step  maximum  likelihood  rule  (OSMLR) 
at  each  node  does  not  necessarily  minimize  the  error  rate  of 
the  total  tree.  The  numerical  example  given  below  bears  out 
this  point. 

Example : 

Assume  the  decision  tree  of  fig. 3.2  where  f1  has  four 
states,  Ta,b,c,d},  f2  can  take  on  any  of  two  states,  Ex,y), 
and  f3,  (u,v).  The  class  conditional  state  probabilities  of 
the  features  and  aprior  class  probabilities  are  tabiilated 
below. 


Using  an  OSMLR  at  each  node  results  in  the  iollouino 
ru I e s • 


f 1 -s  (go  right) 
= b ( lef t ) 

=c  (left) 

=d  (left) 


f2=*  (go  right) 
= y (left) 
f3=u  (go  left) 

= V ( le  f t ) 


The  correct  recognition  rates  of  the  tree  and  nodes, 

are  , 

Pc(T)»0.372  Pc(M)*.63  Pc(l2)=.6  Pc(l3)  = .8 

If  wc  violate  the  naMimuffl  likelihood  rule  at  node  1 and 
go  right  for  f1  = b,  the  nev  perforaiance  figures  are, 

Pc(T)=.384  Pc(l1)*.58  Pc(l2)=.6  Pc(l3>=.8 


Thus,  the  performance  at  node  1 has  DECREASED,  but  the 


I 


85 


total  tree  performance  INCREASED  when  the  OSMLR  was 
violated. 

3.4.1.  A Bound  On  The  Tree  Performance 

It  was  shown  in  the  previous  discussion  that  using  a 
maximum  likelihood  rule  at  each  node  of  a tree  does  not 
necessarily  optimize  the  performance  of  the  total  tree,  even 
when  the  node  decisions  are  assumed  to  be  s t a t i s t i c a 1 1 y 
indpendent.  It  is  clear  .however,  that  the  choice  of  the 
node  features  places  an  upper  bound  on  the  optimal  tree 

performance.  In  this  section,  the  problem  of  computing  tliis 
bound  is  Solved  as  an  optimization  problem  under  linear 
constraints.  The  problem,  so  posed,  is  amenable  to  a host 
of  solution  t ec hn iques C 32] . A method  of  using  dynamic 

programming  to  obtain  the  upper  bound  is  described  here, 
using  the  example  from  the  previous  section  for 

ill u St  rat  ion. 

Under  the  assumption  of  statistically  independent  node 
decisions,  the  tree  performance  is  given  by. 

Pc ( T ) = Zpc  wi  ).  Hpclwi/U)  (3.16) 

The  performance  of  the  rule  used  at  Ik,  averaged  across 
all  classes  under  Ik  is  given  by, 

Pc(lk)  = ^P  (wi  ) .Pc  (wi/lk)  / ^P(wi)  (5.17) 

for  a given  choice  of  features  used  at  Ik,  Pc(lk)  is 

maximized  if  a maximum  likelihood  rule  is  used.  Let  bk 
denote  the  maximum  value  of  the  numerator  of  Eqn.  (3.17), 

i .e  . 

bk  = Max  ( Pc  ( Ik  ) > > 

cj;  C Vi  C Cw.) 


i 


86 


for  ease  o1  notation,  let, 

Pi=P(wi),  y ik =Pc ( w i / I k ) • 

Then,  the  upper  bound  on  Pc<T)  is  obtained  os  the 
solution  oi  the  constrained  optimization  problem. 


Ma*  ^Pi.*^yik  3 • 1 6 ) 

y«K  K 

Y^Pi.yik  ^ bk  , k = l,?, N . 

0,  ^ yik  T«  , k“-1,Z,*#»N  • 

where  M ,N  are  the  number  of  classes  (terminal  nodes) 
and  decision  nodes  in  the  tree,  respectively. 

To  solve  (3.18),  define  the  partial  sum,  6n,  n=1,2,...M, 
as 


V\ 

Gn  =^Pi.nyik  , 1 ^ n ^ M . 
i-.\  ^ 

The  problem  can  be  solved  by  regarding  the  bk  as  the 
total  resources  available  of  each  kind,  and  Gn  as  the  return 
from  the  first  n activities.  Optimizing  Gn,  and  then 
recursively  defining  Gn  in  terms  of  Gn-1  , leads  to  the 
dynamic  programming  formulation. 


Example  : 


Consider  the  tree  below,  consisting  of  three  nodes, 
12.  13.  VtCbt) 


n , 


Ihe  protilem  is  to  mdxiini/e 


r1.y11.>1?  ♦ P2.y?1.y22  ♦ P3.y31.y33  ♦ p4.y41,y43 
sut  j P c t to, 

Pl.v11  ♦ F-?.y21  + P3.y31  ♦ P4.y41  ^ b1 
f1.y12‘P2.y?2  ^b2 

P3.y33'*P4.y43  ^b3 

0.  ^ yik,<  1, 

Define  the  following  sequence  of  optimization  problems: 

&1(z1,z2)  = Max  P1.y11.y12  for  0 . <:  z b1,  0.<:z2^b2. 
where,  0.  ^ y11  Min<1,,  z1/Pl> 

0.  ^ y12  ,<  Minfl.,  z2/P1> 

G2(z1,b2)  = Maxf P2.y21  .y22  ♦ f 1 ( z1 -P2 . y 2 1 , b 2 -P 2 . y 2 2 ) > 

for,  0.>^  z1  b1  , 
where,  0.  y21  Min<1.,  z1/p2> 

n.  ^ y22  Min(  1 . , z2/P2>  . 

G3(z1,b2,y3)  = M ax { P 3 . y 3 1 . y 33*  f 2( b 1 -P3 . y 3 1 , b 2 ) } 
for  0.  ,^<  z1  b1,  P,,<  23^  b3  , 
where  0.  .<  y 31  M i n < 1 . , z 1 / P3  > 

0.  ^<  y33  <:  Min(1.,z3/P3) 

Max  p(  (T  ) =04  (b1  ,b2  ,b3)  r Kcxx  fP4.y41.y43  + 
63(b1-P4.y41,b2,b3-P4.y45)> 

where  0.  v<  y4l  ^ M i n { 1 . , b 1 / P4  > 

0 . y 43  ^ M in<  1 , ,b3/P4)  . 


88 


The  problem  was  solved  by  discretijing  the  ranges  of  the 
variables«  zj  and  yik  into  10  equal  intervals*  The  results 
are  tabulated  for  different  apriori  probabilities*  and 
resource  levels*  bk.  The  relationship  between  the  optimal 
tree  performance  and  a given  node's  performance  when  the 
other  node  performances  are  kept  fixed*  is  depicted  by  the 
plots  in  Graph  1. 


90 


3.5*  A Property  Of  The  Optimal  Feature  Assignment 

In  the  previous  section«  it  was  shown  that  optimizing 
separately*  the  average  correct  recognition  rate  at  each 
node  of  a decision  tree*  does  not  necessarily  optimize  the 
total  tree  performance.  A similar  type  of  'negative' 
property  can  be  shown  for  the  optimal  feature  set. 

Assume  that  we  use  only  a maximum  likelihood  rule  at 
each  tree  node.  One  could  evaluate  each  feature's  'goodness' 
at  each  node  as  the  value  of  Pc(lk)  when  that  feature  is 
used  at  node  Ik.  Then  t e assignment  could  be  done  by 
assigning  that  feature  tc  <ich  node*  which  had  the  best 
value  of  the  goodness  measure  at  that  node . P rope r t y 2 which 
we  show  by  example*  highlights  the  fact  that  such  an 
assignment  is  not  necessarily  optimal. 

Pr  ope  r t y 2 z 


The  tree  resulting  from  using  at  each  node*  that  feature 
which  performs  the  best  at  that  node  using  a maximum 
likelihood  rule*  is  not  necesarily  the  optimal  assignmert* 
even  if  the  features  are  statistically  independent. 


91 


Fig.  3.4 

Consider  the  ? level  binary  decision  tree  shown  i 
fig. 3. 4.  The  features  to  be  used  at  the  nodes  have  to  b 
chosen  from  a set  of  6 featurest  f1-f6,  each  taking  on  on 
of  4 states.  The  probability  of  occurrence  of  each  state, 
for  each  class,  is  given  in  the  table  below: 


V? 


.25  .45  .25  .05 
.15  .55  .25  .05 
.20  .30  .20  .30 
.35  .45  .10  .10 
.20  .25  .15  .40 
.55  .15  .05  .25 


The  apriorl  class  probabilities  aret  .15,.1t«50  and 
.25»  re  spec t i V el y. 1 f a maximum  likelihood  rule  is  used  at 
each  node  to  partition  the  set  of  classes  into  the  two 
groups  distinguished  at  that  nodet  the  correct  recognition 
rates,  using  each  of  the  6 features,  is  tabulated  below: 


FI 

F2 

F3 

F4 

f 5 

F6 

; = =r  = 

= s = 

:r  = * = 

.76 

. 75 

.75 

.75 

.75 

.75 

.60 

. 74 

.71 

.66 

.64 

.62 

.73 

. 67 

.67 

.82 

.67 

.72 

From  the  above  table.  it  would  appear  that  the 
assignment  of  feature  1 to  node  1.  feature  2 to  node  2,  and 
feature  4 to  node  3.  would  result  in  the  least  probability 
of  error  for  the  total  tree,  assuming  a maximum  likelihood 
rule  is  used  at  every  node .However . as  the  figures  below 
prove,  there  are  other  features,  which  together  do  better 
than  the  set  <f1.f2.f4>  used  at  nodes  <1.2.31. 
respectively. 


93 


Feature  used  at 

Model 

Node? 

Nodes 

Recognition  Rate 

mm. 

ss  ssssssssssrssssrrxssssss 

1 

B 

*611 

B 

*613 

B 

B 

*613 

B 

B 

*613 

B 

*613 

Thust  even  using  suboptimal  featurest  2 and  5 at  nodes  1 
and  Zf  we  get  a better  error  rate  than  using  the 
individually  best  features  at  these  nodes*  vie**  1 and  2* 

In  this  chapter*  it  has  been  shown  that  even  under  the 
restriction  that  the  node  decisions  are  statistically 
independent*  th  optimal  tree  design  problem  is  complex*  The 
next  two  chapters  discuss  methods  of  decomposing  the  design 
problem  into  phases*  and  the  use  of  optimization  methods  in 
solving  each  phase  of  the  task* 


w 


94 


4*  A Phased  Approach  To  Optimal  Tree  Design 

This  chapter  proposes  a decomposition  of  the 
hierarchical  classifier  design  problem  into  several  phases* 
and  investigates  the  use  of  dynamic  programming  procedures 
for  obtaining  the  optimal  solution  for  each  phase. 

The  decomposition  of  the  design  allows  the  designer  to 
input  into  the  design  procedure  any  a priori  knowledge 
regarding  the  problem  that  is  available.  This  knowledge 
might  be  related  to  the  tree  structure  or  the  features 
deemed  to  be  'good''  for  distinguishing  certain  classes.  The 
a priori  information  regarding  the  tree  can  be  categorized 
into  t h ree  level  s: 

(a)  No  assumptions  regarding  the  form  of  the  tree. 

(b)  The  tree  'skeleton'  is  assumed  given.  By  'skeleton', 
we  mean,  the  form  of  the  tree  together  with  terminal  node 
labe  Is  .However  , the  features  to  be  used  at  each  node  are  not 
spec  if ied . 

(c)  Both  the  tree  skeleton  and  the  features  to  be  used  at 
the  nodes,  are  s pe c i f i ed . The  decision  rules  at  each  node 
have  to  be  designed. 

The  type  of  assumptions,  (a),  (b),  or  (c>,  made  in  any 
particular  application  depend  upon  the  designer's  knowledge 
about  the  data:  its  modality,  separability,  and  the  goodness 
of  particular  features  in  distinguishing  some  subsets  of  the 
data.  Thus,  in  a general  situation,  when  one's  problem 
knowledge  is  minimal,  only  assumption  (a)  may  be  justified. 
If  through  data  analysis  methods,  such  as  hierarchical 
clustering  or  apriori  knowledge,  a suitable  tree  skeleton  is 
considered  adequate,  assumption  (b)  may  be  used.  If,  in 
addition,  the  designer  knows  that  certain  features  are  good 
for  discriminating  between  some  two  or  more  subsets  of  the 


95 


classesf  one  also  knows  the  features  to  be  used  at  some  or 
all  of  the  tree  nodes.  The  design  problem  is  then  that  of 
case  (c)t  where  the  decision  rule  at  each  node  must  be 
specified  for  overall  optimal  tree  performance. 

The  following  analysis  shows  that  finding  the  optimal 
tree  by  exhaustive  enumeration  is  an  impractical  approach 
for  most  problems. 

4.0 Computational  Complexity 

tree  design  problem  is  a 't h ree-d i mens i ona I ' search 
» ng 

(i)  specifying  a tree  skeleton 

( i i )  assi gni ng  features  to  nodest  andt 

(iii)  specifying  decision  rules  at  each  node. 

Consider  the  number  of  trees  to  be  evaluated  if  an 
exhaustive  search  is  done  across  all  possibilities  in  (i> 
(iii).  If  Ntt  Nf , Ndt  are  respe ct i ve  ly t the  possible  trees* 
assignments  of  features  to  each  tree*  and  decision  rules  for 
each  assignment*  then  the  total  possibilities  are* 

N-  Nt .Nf .Nd 

As  an  example*  consider  a C class  problem*  using  f 
features*  each  feature  taking  on  one  of  S states*  and  with 
the  restriction  that  only  balanced  binary  trees*  with  a 
single  path  to  each  class*  and  using  any  feature  only  at  one 
node*  are  being  considered.  Then* 


ic-v 

The^  possibilities  become  enormous  even  for  'small' 
values  of  ft  Ct  and  S.  The  computational  complexity  of  the 
problem  is  considerable  reduced  by  using  the  optimality 
principle  of  dynamic  programming. 

4.1.  Tree  Design  Using  Dynamic  Programming 

In  this  sectioot  we  examine  a set  of  methods  that  have 
proved  useful  in  operations  research  and  popularly  known  as 
dynamic  prog r a mm  ing [ 2] . 

The  criterion  of  optimality  is  assumed  to  be  a weighted 
sum  of  the  measurement  cost  and  the  risk  of  making  an 
incorrect  classification.  Depending  upon  the  information 
available  to  the  designer  regarding  the  tree*  the  design 
problems  may  be  poseo  in  the  following  way*  in  increasing 
order  of  complexity. 

<1)  Given  the  tree  structure  (i.e.  the  class  hierarchy  and 
the  features  to  be  used  at  the  nodes)t  design  an  optimal 
policy* 

(2)  Given  only  the  tree  skeleton,  (i.e.  the  class 
hierarchy),  determine  the  optimal  feature  measurement  and 
dec i si  on  pol ic  y. 

(3)  Given  no  information  regarding  the  form  of  the  tree, 
design  an  optimal  tree  structure. 

The  algorithms  proposed  in  subsequent  sections  for 
solving  (1)  and  (2)  give  the  optimal  solution  for  any 
assumed  feature  distribution.  The  bottom-up  procedure  for 
problem  (3),  however,  leads  to  the  cptimal  tree  structure 
only  under  an  'additive  cost'  assumption  described  later. 


97 


4.1.1.  Optimal  Decision  Policy  Given  The  Tree 


Consider  the  case  when  the  tree  structure  of  the 
classifier  is  given  and  one  seeks  the  optimal  decision 
policy  at  each  of  the  nodes,  one  would  also  like  to  have  the 
facility  of  terminating  the  measurement  process  at  any  time 
and  'accept'  one  of  the  class  labels  which  has  NOT  been 
rejected  till  that  point. In  a decision  treei  some  classes 
are  rejected  at  every  node.  Thust  the  design  procedure  must 
compute  at  each  node  the  optimal  action  to  be  taken,  and 
this  can  be  one  of  the  following  : 


. classify  the  sample  into  some  class  not  rejected 
so  far  or, 

. continue  the  measurement  process  and  decide  the  next 
node  to  be  traversed  under  the  current  node. 

The  following  quantities  will  be  defined  to  illustrate 
the  dynamic  programming  (abbreviated  DP)  procedure,  using 
the  example  of  fig.  4.1  above. 

in  = node  label  at  which  feature  In  is  used, 
fnl,  fnr  are  the  features  used  at  the  nodes  to  the 


1 


98 


lelt/right  below  Inlassuming  a b in ary  decision  node>t 
c(tn)=cost  of  measuring  feature  fn 

r { K 1 , . »n ) = mi ni mum  risk  of  decision  process,  given  that 
x1,x2.»»n,  have  been  observed  so  far  in  the  tree. 

R ( w i / X 1 , X 2 . . xn ) = lo ss  incurred  in  classifying  sample  into 
wi  given  x1,x?..xn. 

p ( wi / X 1 , x2  , X 3. . . xn ) = cond i t i ona I prob.  of  w . given 
x1  ,x2,...xn. 

For  the  sake  of  notational  brevity,  assume  that  the 
string  11,12,... In  denote  the  nodes  cn  the  path  to  In  from 
the  root. If  lnl,lnr  denote  the  sons  of  In  and  the 

corresponding  feature  observations  are  xnl,xnr,  then,  the 
minimum  risk  r(x1,x2,...  xn)  may  be  computed  recursively 


as, 

r(x1,x2...xn)=Min 


(i)  f^in  R (wk  / x1  , x2  ..  xo)  4 . 1 ) 

V//, 

(ii)  c ( f n I ) •♦  E f r ( X 1 , X 2 . . . X n , xn  I ) > 


Lfiii)  c(fnr)*Efr(x1,x2..,xnr)) 
where,  the  E O terms  are  expectations  of  the  risk,  r() 
over  all  possible  values  of  xnl,  or  xnr,  respectively. 


If  In  is  a terminal  node  at  which  the  decision  is  to 
classify  the  sample  in  class  wk  , then, 
r(x1,x2..xn)~R(wk/x1,x2..xn> 

Working  in  a bottom-up  fashion  from  the  terminals,  we 
can  determine  the  optimum  policy  and  risk  at  each  node. In 
the  recursive  equation  (4,1),  if  the  quantity  (i)  is 
minimum,  then  it  implies  that  it  is  best  to  classify  the 
sample  and  discontinue  further  measurement,  while  if  (ii)  or 


(iii)  are  the 

smallest. 

it  implies  t hat 

the 

best 

action 

i s 

to  traverse 

the  left 

or  right 

node 

respectively 

. 

Correspondingly,  one  set 

or  the  other 

set 

of 

classes 

i s 

99 


rejected  from  further  consideration. 

Comput a t iona U y t the  decision  tree  formulation  is  less 
burdensome  than  the  sequential  process  described  in  C53« 
where  at  each  step.  R ( wk / x 1 . . x n ) must  be  computed  for  all 
classes.  wk.In  the  decision  tree*  the  number  of  classes  to 
be  considered,  increases  from  onelat  a terminal  node)  to 
Kli.e.  the  total  It  of  classes),  at  the  root. 

Note,  that  if  all  measurement  costs. c(fi).  are  zero,  the 
quantity  in  (i)  can  never  be  the  smallest.  since  more 
measurements'  cannot  increase  the  r i sk ( as  sum i ng  perfect 
knowledge  of  p ( x 1 . . . xN/ w i ) . Hence,  if  measurement  costs  are 
zero,  an  optimum  policy  will  only  make  the  classification  at 
a terminal  nod  e. 


y 


I 


100 


4.1*2*  Optimal  Feature  Ordering  and  Decision  Policy 

In  this  section,  .we  consider  the  case  where  the  tree 

skeleton  is  given,  but  the  features  to  be  measured  at  the 
nodes  are  not  spe c i f i ed . Thus , the  optimal  policy  must 
specify  not  only  the  action  to  be  taken  , but  also  the  best 

feature  to  be  measured  next,  in  such  a manner  that  the 

average  cost  (measurement  plus  mi  sc  la s s i f i c at i on  risk)  is 
minimized.  Since  the  tree  form  is  given,  the  classes  at 

each  node  that  have  not  been  rejected,  are  known.  Hence,  for 
a given  set  of  v al ue s , xt 1 , x t 2 , • • «x t n , of  the  features  in  the 
set  F t n c=  FN,  the  total  feature  set,  we  must  decide; 

(i)  whether  the  sample  should  be  classified  into  a class 
that  has  not  been  rejected  so  far,  or 

(ii) reject  one  setfleft  subtree's  terminals)  or  the  other 
from  consideration,  and, 

decide  the  feature  to  be  measured  at  the  next  node. 

Let , 

ftn=set  of  n features. 

rn ( xt 1 , xt 2 , . . X t n/ F t n > =min i m urn  cost  of  making  a 
decision  at  node  In,  given  the  particular  observed 
values  of  the  particular  feature  set,  Ftn. 

R ( w i / xt 1 , X t 2 . . X t n , F t n ) - lo ss  incurred  by  classifying 
sample  into  class  wi,  given  observations 
on  features  in  ftn. 


101 


Then,  rn()  can  be  computed  in  a backward  fashion  using. 


rn( *t1 


* t n/  f t n ) i n 


Min  R (w j /x t1 • • X tn , Ftn) 
wi  e.wn 

Min  c ( f k)  ■♦£ <rn I ( X 1 1 . • X tn  , X k ) ) 
f k ^ Ftn 

Min  c ( f k ) *E (rnr ( X 1 1 • . X t n , X k ) > 
fk^  Ftn 


where  rnl,rnr  are  the  minimum  costs  at  the  left  and 
right 

sons  of  node  In. 

For  terminal  node.  In, 


rn(xt1,xt2,..xtn/Ftn)=R(wk/xt1..xtn,Ftn) 


The  computational  complexity  is  considerably  increased 
by  the  necessity  to  consider  all  subsets  of  n features  at 
any  node  n levels  below  the  root. 


102 


A.1«3.  Optimal  Tree  Structure 

In  many  situations*  one  has  no  a priori  information 
regarding  a 'good'  tree  structure.  The  design  could  then  be 
split  into  two  phases.  In  the  first  phase,  the  optimal  tree 
structure  is  designed,  assuming  for  example,  that  a maximum 
likelihood  rule  is  used  at  every  node.  In  the  second  phase, 
the  procedure  of  Sec. 4. 1.1  can  be  used  to  determine  the 
optimal  decision  policy  at  every  node.  This  section 
describes  a 'bottom-up'  design  method  for  obtaining  the 
optimal  tree  structure.  Sufficient  conditions  for  the 
optimality  of  the  resulting  structure  are  also  discussed. 

Let, 

U-total  set  of  classes(=Ml 
Wkc:.W  denotes  a k-class  subset  of  the  classes, 

T(W)s  tree  used  to  separate  the  classes  in  W . 

The  labels  (feature  names)  used  at  the  nodes,  complete 
the  tree  description. 

T* ( W) sopt i ma  I tree  on  W,  i.e.  one  that  minimizes  cost. 
The  'cost'  of  a decision  tree,  r(T(u)>,  is  given  by, 
r(T(W))=  y.  P(wi  )[Zc(  fk)^(1-  TT(i  -Pe(wi/lk)))3 (4.3) 

wi  € W lkCS(Lii)  lktSCu-:3 


where, 

Pe ( wi / I k ) = pr obab i I i t y of  error  at  node  Ik  if  sample 
^s  f roffi  class  wi . 

The  following  assumotions  have  been  made  in  eqn.(4.3). 


103 


(i)the  prob.  of  correct  recognitor  of  a random  sample 
from  wit  is  a product  of  the  prob.  of  correct 
recognition  at  the  nodes  on  the  path  to  wi. 

(ii)the  measurement  cost  of  m i sc  I a ss i f i ed  samples  is 

negligible  compared  to  that  of  correctly  classified 
samples* 

A sufficient  condition  for  (i)  to  b«  valid  is  that  the 
features  on  any  path  in  the  tree  fom  the  root  to  a terminal 
node  are  statistically  independent «C ondit ion  (ii)  is  valid 
if  the  error  rate  of  the  tree  is  low. 


The  recursive  definition  of  the  optimal  tree  .T*(Wk>t  to 
classify  the  k-class  set  Ukt  is  written  as» 

r( T*(Wk))-min  Cr(T(Wk))3 
Vt  (Wk) 


= Minr  Min  C P ( w i > ( c ( f k ) ♦Pe ( w i / I k ) ) ♦ 


r <T*(yi))*r(T*(W2)) 
where  U1 U W2  > Uk 


(4.4) 


In  the  above  formula^  the  cost  of  the  treetffWk).  has 
been  written  as  the  sum  of  the  costs  incurred  at  the  top 
nodelthe  summation  term)(  plus  the  costs  of  the  optimal 
trees  for  sets  WI  and  U2t  viz..  T*(Ul)  and  T*(U2).  In  the 
neat  section,  we  provide  a justification  for  this  assumption 
of  additive  costs. Under  these  conditions,  it  is  shown  that 
the  recursive  definition  (4.4)  results  in  the  optimal 
solution  tree.T*(W).  Note  that  in  (4.4).  the  error  rate  for 
wi  at  node  Ik.  Pe(wi/lk).  depends  not  only  on  the  feature 


104 


used  at  Ik*  but  also  on  the  dichotomy  of  classes  that  is 
pe r f or med ( i . e • U1  versus  U2)« 

4.1.3*1>  The  Additive  Cost  Assumption 

Assume  that  the  features  used  at  the  nodes  on  a path  in 
the  tree  are  statistically  independentt  and  that  the  errors, 
Pe(wi/lk>,  are  small. Then,  terms  such  as  Pe ( w i / I k ) • Pe ( w i / I j ) 
can  be  ignored  in  comparison  to  the  first  order  terms.  If 
this  done,  the  error  rate  of  the  tree,  Pe(T),  is  given  by, 

Pe(T)*  (1-  wi ) .TT(1-Pe(vi/lk) )) 

P(  wi  )(  1- (1 Pe  (u  i / I k ) 4h  igher  order  terms)) 
wi  Ik  where  Ik  € S (wi ) . 

»^P(  wi  )^  Pe  ( wi  / I k ) ignoring  higher  order  terms, 
wi  lk€S(.u3i:i 

Hence,  the  total  error  probability  is  roughly  the  sum  of  the 
error  rates  for  each  class  at  all  the  nodes  on  the  path  to 
that  class.  If  we  also  assume  that  the  measurement  cost  on 
misc  lassi f ied  samples  is  negligible  compared  to  the  total 
cost,  we  get, 

r(T(H))=yp(wi)X<c(fk)  + Pe(wi/lk)>  (4.5) 

wi  C y Ik  e S(wi) 

It  follows  immediately  from  (4.5)  that  the  cost  of  a 
decision  tree  is  the  sum  of  the  cost  incurred  at  the  root 
node  plus  the  costs  incurred  in  each  subtree  below  the 
root.Hence,  we  are  justified  in  writing  (4.4)  as  the  sum  of 
the  cost  te rms  .There fore , because  the  costs  are  additive. 


105 


nininizing  each  term  would  minimize  the  total  cost. 
Thereforef  the  recursive  formula*  (A .4) « leads  to  the 
globally  optimal  tree  for  the  set  of  classes*  W. 

The  following  theorem  generalizes  the  conclusion  reached 
in  the  above  discussion  and  gives  sufficient  conditions  on 
the  criterion  function*  pc(T)  of  tree  goodness*  under  which 
the  dynamic  programning  algorithm  for  tree  design  described 
in  the  preceding  section  will  be  optimal. 

Theorem 

Let  Pc(T)  be  the  goodness  measure  of  a binary  tree  used 
to  separate  a set  of  classes*  W.  L«t  10  be  the  root  of  T*  fO 
the  feature  used  at  lO*  and  TO(IO)  and  TI(IO)  the  left  and 
right  subtrees  below  node  10.  Assume  that  Pc(T)  can  be 
written  as  the  sum  of  three  terms  as  follows: 

Pc(T)  = Pc(TOdO))  ♦ PclTKlo))  ♦ 6(WO*Wl*fO) 

where  the  first  two  terms  are  the  goodness  measures  of 
the  two  subtrees  TO(lO)  and  T1(10)*  and  the  term  6 depends 
ONLY  on  the  feature  fO  used  at  10*  and  the  sets  of  classes 
WO*  and  W1  distinguished  at  lO*  but  NOT  on  the  tree 
structures  TO  and  T1.  Then  the  algorithm  of  Sec. 4. 2. 3 finds 
the  optimal  binary  tree*  T**  to  separate  the  classes*  W* 
under  the  assumption  that  each  class  occurs  at  only  one 
terminal  node  of  T*. 

Proof  : 

Let  W'  be  any  subset  of  classes  of  the  total  set*  U*  F 
be  the  total  set  of  features.  Let  Tmlw")  be  the  optimal 
binary  treeli.e.  the  one  that  maaimizes  PclTlU"))*  under  the 
assumption  that  each  node  uses  a single  feature*  and  each 
class  occurs  at  only  one  terminal  node  of  the  tree.  Then 


under  the  assumption  that  Pc(T)  can  be  written  in  the  above 
form,  it  follows  that, 

Pc(T*(W'))  * Max  { PcCT(W'))  > 

v'TCWO 

epax  Max  t Pc (T ( WO) )♦ Pc (T (W1 ) ) ♦6(W0,Wl,f)3 
VTCWcj.TCV;,) 

where,  UOUwl  » W',  f£F,  and  T(wO)  and  T(U1) 
are  the  subtrees  of  any  tree  T for  classes 
W",  and  have  terminal  classes  as  UO  and  U1. 

Since  the  G term  does  not  depend  on  the  trees  T(W0>, 

T(U1),  but  only  on  the  sets,  WO  W1  separated  at  the  top  node 
of  T(W'),  and  si  nee  the  first  two  terms  can  be  maximized 

separately,  we  get, 

Pc(T'»(W'))=  Max  Max  C Pc  ( T*  ( W 0) ) *P  c(  T*  ( W 1 ) ) ♦ G ( W 0 , W 1 , f ) 3 
where  UO  U1  » W'  ana  f e.  F • 

The  above  equation  is  precisely  the  recursive  form  used 
by  the  algorithm  of  Sec.  4.1. 3.,  as  it  builds  the  optimal 
tree  T*(Uk)  for  k class  subsets  Uk  of  U.  Since  T*(Uk>  is 
optimal,  it  follows  that  T*(W),  the  final  tree  obtained, 
must  also  be  optimal. 

It  is  seen  that  the  distance  measures  discussed  in 

Sec. 3. 2. 2 satisfy  the  conditions  of  this  theorem  and  hence 

the  same  algorithm  could  be  used  to  obtain  an  optimal  tree 


using  the  criterion  given  by 
sec  t ion • 

equation  (3 

.4 ) in 

that 

This  chapter  has 

shown  the 

feasibility 

of  using 

the 

dynamic  programming 

formulat ion 

for  sol  V i ng 

each  of 

the 

three  phases  of  tree 

design.  The  solutions 

obt  a i ned 

are 

107 

optimal  under  fairly  general  assumptions*  Thus*  the  optimal 
decision  policy  and  optimal  feature  ordering  are  obtained 
without  making  any  particular  assumptions  regarding  the 
fearture  d i s t r ib ut i ons « such  as  statistical  independence* 
The  optimality  of  the  tree  structure  obatined  via  the 
'bottom-up'  procedure  has  been  established  under  an  additive 
cost  a ssump  t i o n* 

In  spite  of  these  general  conditions  for  optimality* 
dynamic  programming  methods  suffer  from  a high  computational 
complexity  when  applied  to  'large'  problems*  In  the  context 
of  tree  design*  a large  problem  is  ore  where  the  number  of 
classes  to  be  distinguished  is  large*  or  the  features  from 
which  the  selection  of  'good'  features  is  desired*  are 
numerous*  The  next  chapter  investigates  methods  of  reducing 
this  design  complexity* 


108 


S*  r^ethods  Of  Reducing  The  Computational  CompleMity 

This  chapter  proposes  various  methods  of  reducing  the 
computational  burden  incurred  in  obtaining  the  the  optimal 
decision  policy  for  each  node  of  a given  tree  structure  and 
the  optimal  feature  assignment  for  each  node  of  a given  tree 
ske leton. 

The  reduction  in  computations  incurred  in  determining 
the  best  decision  policy  is  achieved  in  the  following  ways: 

(1) 

(2> 

(3> 

(4) 


Reduction  in  the  complexity  involved  in  getting  the  best 
assignment  of  features  to  the  nodes  of  a given  tree 
skeleton^  is  achieved  as  follows: 

(1)  It  is  assumed  that  the  decisions  at  the  nodes  are 
statistically  independent; 

(2)  the  choice  of  the  feature  to  be  measured  at  each 
node  is  fixed  once  the  tree  is  designedt  i«e«  this 


It  is  assumed  that  the  node  decisions  are 
statistically  independent 

The  decision  at  a node  is  computed  only  as  a 
function  of  the  feature  measurement  at  that  node* 
For  non-metric  featurest  the  decision  rules  can  be 
■'clustered"  and  each  cluster  represented  by  a 
prototype*  The  DP  procedure  can  then  be  used  to 
search  across  all  prototypes  rather  than  all  rules* 
Many  decision  rules  can  be  discarded  from 
consideration  if  they  are  "dominated"  by  others* 
Thus  only  the  set  of  "efficient"  rules  need  to  be 
searched*  A br anch-and-bound  algorithm  for  finding 
the  set  of  efficient  rules  is  described* 


l 


109 


choice  is  not  a function  of  the  feature  values 
observed  for  a test  sample; 

(3)  at  each  node*  the  features  can  be  ranked  and  certain 
features  discarded,  thus  reducing  the  search 
dimensionality. 


5.1.  Decision  Policy  Design  Given  The  Tree 


A dynamic  programming  method  for  obtaining  the  optimal 
decision  policy  was  described  in  Chapter  4.  In  this 
bottom-up  approach,  the  risk  was  calculated  at  each  node  as 
a function  of  the  past  measurements  and  the  optimal  risk 
from  this  point  onwards  to  the  terminals  below  that 
node . C ons i de r a path  in  the  tree  through  nodes, 
1 1 1 , 12 . . . I k > , to  classes  w1,w2,  as  shown  in  Fig. 5.1  below. 


I 

I 


In  the  DP  approach,  we  compute  r ( « 1 , a 2 , . . x k ) as  the 
minimum  m i s c I a ss i f i c at i on  risk  at  Ik,  for  all  sequences. 


Cx1,x2,.. 

. X k } 

. Then 

we  back  up 

to 

lk-1  and 

choose 

the 

decision 

(go 

left  or 

right)  for 

each 

poss ib  le 

subsequen  ce , 

<x1,x2,.. 

X k -1 ) , and 

so  on  up 

the 

tree.  To 

compute 

the 

dec i s i on 

the 

entire 

past  history 

of 

measurements  and 

the 

optimal  decision  sequence  among  all  subsequent  decision 
paths,  are  considered.  In  this  way,  the  globally  optimal 
policy  is  obtained. 

At  the  other  extreme,  one  could  use  only  the  last 
measurement  to  make  the  decision.  This  corresponds  to  using 
a 'one-step'  policy,  as  described  in  Chapter  3.  It  was 


shown  there,  that  using  a ina«iinuii  likelihood  policy  (OSMLR) 
to  optifniie  each  node's  correct  recognition  rate,  does  not, 
in  general,  optimize  the  total  tree's  per f ormance  .Th i s is  to 
be  expected,  since  in  the  OSMLR  method,  no  use  is  being  maoe 
of  the  history  of  previous  measurements,  or  the  risk  of 
subsequent  decision  paths* 

A question  that  arises  is:  between  these  two  extremes, 
viz.  DP  which  uses  all  the  'past',  and  projects  into  all 
possible  'future'  decisions,  and  the  OSMLR,  which  uses  only 
the  'present',  can  one  steer  a middle  course,  and  use  some 
'compact'  summary  of  past  measurements,  rather  than  the 
measurments  themselves.  Such  a strategy  may  do  at  least  as 
well  as  OSMLR,  while  simultaneously  reducing  the  computation 
and  storage  requirements  of  the  DP  algorithm. 

One  such  'compact'  measure  of  previous  observations, 
given  that  one  is  at  node  ,lk,  is  the  a posteriori 
likelihood  of  each  class,  given  that  we  have  traversed 
nodes,  1 1 , 12 , ..  .1 k- 1 . The  aposteriori  likelihood  is  a 
function  of  the  feature  values  measured  at  the  previous 
nodes,  as  well  as  the  decision  policy  used  at  those  points, 
and  it  has  a simple  form  when  the  features  are  statistically 
independent.  The  following  theorem  proves  that  such  a 
'condensed-history  decision  rule'  (abbreviated  CHDR), 
performs  at  least  as  well  as  an  OSMLR  policy,  and  is  bounded 
above  by  the  optimal  solution  obtained  by  DP  methods* 


111 


Theorem  : 

Assume  that  all  features  along  the  path  of  Fig, 5.1  are 
statistically  independent  Consider  the  following  three  ways 
of  computing  the  decision  policy  at  node,  Ik, 

OP:  If  xk=s  is  observed,  ana, 

if  P ( w 1 ) ,p ( X 1 , * 2 , , . X k - 1 , s /w  1 ) 

">  P(w?).p(x1,x2,,,xk-1  ,s/w2) 
then  classify  X in  w1 , else  in  w2. 

OSMLR:  If  xk=s  is  observed,  and 

if  P ( w 1 ) . p ( X k = s / w 1 ) y P ( w2  ) , p ( X k = s / w 2 > 
then  classify  X in  w1,  else  in  w2. 

CHOR:  Let  P r ( 1 1 , 1 2 , , I k -1 /w i ) te  the  probability  that 

a sample  from  wi  will  arrive  at  Ik,  and 
if  xk=s  is  observed  and, 
if  P(w1).p(xk=s/w1).Prtl1..lk-1/w1) 

> P(w2).p(xk=s/w2).Pr(l1,..lk-1/w2) 
then  classify  X in  w1 , else  in  w2. 

If  Pc(OP),  Pc(OSMLR),  Pc(CHOR),  are  the  average 
performance  (probability  of  correct  recognition),  of  each 
rule  when  used  at  Ik, given  the  same  decison  rules  at  all 
previous  nodes,  then. 


Pc(OSMLR)  Pc(CHDR)  4 


Pc ( DP) 


112 


Proof  : 

Let  d 1 , dZ t . . dV -1 , represent  the  decision  vectors  at 
nodesf  I1»..lk-1t  and  X(d1),  X ( d2 ) t • «X ( dk -1 ) , the  sets  of 
samples  that  would  be  passed  down  this  path  by  each  rule* 
Then,  the  set  arriving  at  Ik,  is, 

XCdl ) n X(d2 )H  . . . ■ n X (dk-1 ) 5 X ( 1 1 , I ^ . I k -1 ) 

The  probability  that  a random  sanple  from  class  wi  will 
arrive  at  node  Ik,  is  given  byf 

Pr(l1  ,l  2..lk-1/wi  ) = X P(X/w  i) 

The  correct  recognition  rates  of  the  three  rules  can  be 
wri tten  ast 

Pc ( D P ) = X X ^ P(w1).p(X/w1).p(xk/w1), 

X P < w2  ) • p ( X / w2  ) . p(  X k / w2  ) > 

wh^r©  Xt  X(d1  f dZ  f m m 

Pc (OSMLR)*  X_P (X (d1 ,d2 , . .dk-1 ) /w *)  .Max<  P ( w 1 ) p ( x k / w 1)  , 

P(w2)p(xk/w2)  > 

where  w*  *w1  , if  P ( w 1 ) p(  x k / w 1 ) >•  P ( w 2)  p ( x k / w 2 ) 
=w2  otherwise. 

Pc ( CHDR )=yMax{  P(w1).p(xk/w1).Pr(X(l1,l2,..lk-1)/w1), 

Xvt  P(w2).p(xk/w2>.Pr(X  (11  ,l2..lk-1)/w2)  > 

From  the  algebraic  inequality, 

Y ^Max(ai.bj,ci.dj>  Max(bjXai»di  Yci> 

4*  f Lmm  • • 

^ ^ ■ i *■ 

and  by  equating,  a 

ai,  ci  , to  p(X/w1)f  p(X/w2>t  respectively, 

^ai,  X-c  i to  Pr  ( X { 1 1 , I 2 ..  I k-1  )/ w 1 ) and 

A A 

P r ( X ( 1 1 . . I k -1  ) / w2  ) , and  bj,  dj  to 
P ( w 1 ) .p ( X k/ w 1 ) , P ( w2 ) /p ( X k/ w2  ) it  follows  that* 


Pc (OP)  ^ Pc (CHOR) 


1 13 


Similarly,  by  using  the  fact  that, 

T.  Ha  I ( a 1 . b j , a ? . c i ) ^ %-  a* .Ha x (b j , c j ) 

J . ^ 

where  a*  = a1,  if  bj;^cj 

= a?  otherwise, 

it  follows  that, 

Pc(CHOR)  ^ Pc(OSHLR) 


This  theorem  suggests  the  use  of  the  measure, 
pr ( X ( 1 1 , , ) / w i ) , in  finding  the  optimal  policy  at  Ik,  vit,, 
dk.lf  each  feature  has  m states,  a decision  vector  di  has  m 
components,  dij,  j=1,2,..m,  where,  d ij =n  implies  that  if  fi 
is  in  state  j,  the  n th.  son  of  the  current  node  is  to  be 
traversed  next.  For  statistically  independent  features,  and 
particular  decisions,  d1,d2..dk,  one  can  write. 


Pr  (X(  1 1 


U-1)/wk)  = 


n 


Pr (X (di )/ wk ) 


^TTgkfdi) 

where  gk(di)  is  the  summation  within  the  square 
brackets. 

This  sum  is  taken  over  all  values  of  j Such  that  d i j =n , 
and  the  class  wk  is  under  the  n th.  son  of  node  li,  e .g . in 
a binary  tree,  dijCi1,2>,  and  in  Fig. 5.1,  and  for  node  Ik, 
and  class  w1,  the  summat  ion  would  be  over  all  j such  that 
dk  j *1 . 

The  history  of  previous  measurements,  is  thus  retained 
in  the  g functions,  since  their  product  represents  the 
probability  that  a sample  from  class  wi  will  arrive  at  node 


I 


Ik*  The  DP  method  can  be  used  to  evaluate  the  optimal 
one-step  policy  at  each  node*  but  unlike  in  the  Chapter  4 
formulation*  the  decision  di*  at  each  node  is  derived  as  a 
function  of  the  decision  vectors  used  at  the  ancestor  nodes 
of  lif  rather  than  the  actual  measurements  made  at  these 
nodes*  xl1*)il2*  etc* 


5.1.1*  Optimal  One-Step  Decision  Policy 


The  decision  tree  of  Fig. 5. 2 shoys  the  nodes,  li,  and 
the  decision  vectors,  represented  by  di,  used  in  classifying 
a subset  of  classes,  viz.  < w1 ,w2 ,w3 ,wA) .A  dynamic 
programming  procedure  can  be  used  to  derive  the  optimal 
'one-step'  policy  at  each  node.  For  statistically 
independent  features,  the  correct  recognition  rates  of  the 
classes  can  be  written  in  terms  of  the  g functions 
introduced  in  the  last  section. 

Pc(u1)=g1(df).g1(d2).gT(d3).g1(dA) 

Pc(w2)=g2(d1).g2(d2).g2(d3).9?(dA) 

Pc(w3)=g3(d1).g3(d2).g3(d3).g3(d5) 

Pc(w4)=g4(d1).gA(d2).g4(d3).g4(d5) 

The  total  tree  performance  is. 


Pc  (7)=  II  P (w  i)  ,Pc(wi  ) 


(5.1  ) 


116 


Assume  each  feature  has  n states.  One  can  compute  the 
optimal  decision  rule  d4*  as  a function  of  d1»d2,d3t  as 

d4)*(d1,d2,d3)  * 0 if 

P(w1)  .g1(d1).g1(d2).g1 (d3).p(xA  = j/w1) 

> P(u2).g2(d1),g2(d2).g2Cd3),p(x<.  = j/w2) 

* 1 ot  herw  ise  I 
for  j~l92t***m. 

This  rule  uses  the  probability  cf  a class  wi  sample 
arriving  at  node  A,  viz.*  g i (d 1 ) gi ( d2 ) g i (d 3 > t and  the 
observation  *4 =j » to  make  the  decision.  Once  d4*  and  d5* 
have  been  calculated  for  all  d1td2*d3t  One  can  back  up  to 
node  3»  and  compute  d3*  as  a function  of  d1td2t  as  that 
value  of  dA  which  maximizes  the  quantity^ 

Max  P(w1).g1(d1).g1(d2).g1  (d3).g1(dA*)4 
Vd3  P(w2).g2(d1).g2(d2),g2(d3).92(dA*)* 
P(w3).g3(d1>.g3(d2).g3(d3).g3(d5*)+ 
P(wA).gA(d1).gA(d2).gA(d3).gA(d5*) (5.2) 

In  the  above  sum.  note  that  the  values  of  dA,  d5 
substituted  are  those  obtained  in  the  last  step*  viz. 
dA* ( d1 9 d2 « d 3 ) • and  d5* (d1 9d2 9d3 ) . Working  up  the  tree.  one 
can  finally  compute  the  optimal  decision  vector  at  II9  d1*. 
Once  this  is  done.  d2*  is  found  from  the  table  of  d2*(d1> 
computed  during  the  backward  procedure.  d3*  from  the  table. 
d3*(d1.d2).  etc.  The  resulting  decision  rules  are  optimal 
one^step  rules,  though  the  globally  optimal  rules  can  be 
obtained,  in  general,  only  by  using  all  the  measurements 
already  observed,  as  described  in  Chapter  A. 

While  giving  up  optimality.  the  above  procedure  has 
resulted  in  a considerable  saving  in  storage  as  far  as  the 


117 


FINAL  decision  tree  is  concerned.  The  final  tree,  has  stored 
at  each  node,  an  m-word  decision  vector  for  the  m possible 
states  that  can  be  observed  at  that  point.  For  the  globally 
optimal  solutioni  one  would  have  to  store  at  a nodet  a 
decision  vector  for  every  possible  sequence  of  measurements t 
that  may  lead  to  that  node.  For  a nooe  n levels  deep  in  the 
tree,  and  assuming  that  approximately  half  (binary  tree 
assumed)  the  decisions  are  to  go  left/right,  the  storage  of 
the  decision  policy  at  that  node  would  require  the  following 
number  of  words: 


However,  there  is  no  saving  in  computing  or  storage  cost 
in  this  method,  as  compared  to  the  general  DP  method  of 
Chapter  4.  since  both  methods  compute  the  optimal  decision 
as  a function  of  every  possible  sequence  of  previous 
ob se r va t i on s /d ec is i on  vectors  during  the  backward 
procedure . 

Two  techniques  for  reducing  this  cost  are  described  in 
the  next  section.  They  use  the  g functions  to  transform  the 
search  domain  from  that  of  the  discrete  non-metric  space  of 
the  decision  vectors,  di.  to  the  metric  space  defined  by 
the  real-valued  functions,  gk(di). 

3.1.2.  Clustering  in  Decision  Space 

In  the  previous  section.  it  was  shown  that  dynamic 
programming  can  be  used  to  recursively  specify  the  optimal 
decision  vector  at  each  node  of  a hierarchical  decision 
tree.  When  the  features  used  at  each  node  are  discrete  and 
non-metric,  taking  on  one  of  m states,  the  decion  vector  for 
a binary  tree,  is  a m-dimensional  binary  vector. Its  i th. 
element,  di,  specifies  whether  to  traverse  the  left  or  right 
subtree  below  that  nude  when  the  feature  is  in  state  i.Thus. 


■e 


118 


at  every  nodet  one  must  choose  from  among  the  set  of  (?) 
possible  vectorsfdt  denoted  by  D*  For  a node  n levels  deep* 
the  DP  procedure  would  have  to  evaluate  ( 2 pos  s i b i • i t ie  s 

in  order  to  compute  the  optimal  policy  at  that  node.  This 
'mushrooming'  of  the  Computations  as  a function  of  m and  n. 
makes  the  OP  method  impractical  for  all  but  very  simple 
prob lems . 

This  section  presents  a method  of  grouping  the  decision 

vectors.  d£Dt  into  sets.  D1.  D2  Dk»  and  choosing 

compact  representation  of  each  set.  bi»  by  a prototype 
vector,  ti.lf  the  decision  space  at  the  i th  node  is 
partitioned  into  mi  sets,  and  each  set  represented  by  a 
single  prototype,  the  DP  procedure  searches  for  the  optimal 
rule  only  from  amongst  the  prototypes.  Thus.  the 
computations  neeaed  to  find  the  optimal  policy  at  a node  n 
levels  deep,  are  reduced  to. 

•A 


TT  mk 


mn 

which  is  less  than  (?) 


The  reduction  in  computations  is  gained  at  the  expense 
of  a departure  from  optimality.  since  only  the  prototype 
vectors  are  searched  by  the  dynamic  programming 
procedure .The  prototype  vector  represents  more  than  vector. 
This  is  achieved  by  specifying  some  of  its  m bits  to  be 
'don't  care'  bits  rather  than  0 or  I.The  selected  don't  care 
bits  are  such  that  varying  them  keeps  the  recognition  rate 
for  all  P classes  within  a suitable  tolerance  interval. 
Thus,  two  vectors,  d and  d'  from  the  same  cluster  differ  in 
their  performance  (for  each  of  the  fl  classes),  only  in  the 
specified  margin.  The  analog  of  this  clustering  (grouping) 
concept,  for  the  case  of  continuous  valued  variables.  is 
that  of  discretiring  the  variables  for  use  in  dynamic 
programming 


The  next  section  describes  a similarity  measure  for 
decision  vectors.  This  measure  will  be  used  subsequently 
for  clustering  the  vectors. 

5. 1.2*1.  Similarity  Measure  For  Decision  Vectors 

Two  decision  vectors  which  are  candidates  for  use  at  a 
node  It  are  similar  if  exchanging  one  with  the  other  does 
not  change  the  tree  performance  by  a significant  amount. 
This  simple  definition  of  similarity  can  be  used  to  cluster 
decision  vectors  which  can  be  used  at  1.  Assume  for 
simplicity  that  a binary  decision  is  made  at  1.  Also  lett 
m:  number  of  states  that  can  be  assumed  by  feature  used 
at  It 

di:  a binary  opcision  vector  of  m componentSt 
dijt  j~1t2t»»»m. 

D:  the  total  set  of  decision  vectors  dit  i=1t?»»»»<2> 
that  can  be  used  at  node  1. 

Pclwk/di):  correct  recognition  rate  for  class  wk 
if  rule  di  is  used  at  the  node* 
pk ( j ) :prob ab  il i t y that  in  class  wk  the  feature  is 
in  state  j. 

Hence  t 


Pc(wk/di)=  ^ pkCj) 

if  wk  e W0(  1) 

Pc(wk/di)=  ^ pkCj) 

if  wk  c W1 ( 1) 

iy. 

ti:  a prototype  vector  whose  m elementSt  tijt 
je1t2t*>n)t  can  have  values  '0'  or  '-'t 

i.e.  the  'don't  care'  state.  Thus  ti  can  be 
used  to  represent  a group  of  vectors  <d>t 
e.g.  if  ti=(10-1-)»  it  represents  the  set 
<10010.  10110.  10011.  10111>.  Vectors  such 


1?0 


as  ti  will  be  used  to  represent  clusters  of 
rules  satisfying  a certain  similarity 
criterion.  This  criterion  is  discussed  next. 


Consider  the  decision  tree  shown  above  in  Fig.  5.3.  If 
di  denotes  a decision  vector  used  at  node  lifthen  under  the 
assumption  that  the  features  used  at  the  various  nodes  are 
statistically  independent,  the  average  correct  recognition 
rate  of  the  tree,  Pc(T),  can  be  written  as  a sum  of 
products,  as: 

K 

Pc  (T)  = P (wk)  . T\  Pc(wk/di) 

K- 1 6 ^ C k2 

To  compare  the  similarity  in  the  tree  performance  Pc(T), 
when  two  alternative  rules  d,  and  d'  are  used  at  some  node, 
li,  one  can  define  an  M dimensional  vector  function,  fi(d), 
which  maps  any  decison  rule,  d,  used  at  li,  into  a point  in 
M-space,  given  by  : 


Pc (w1 /d) 
Pc (w2/d) 


and 


fi(d)  = 


Pc (w ) /d)  I 

[p^wM/dlj 

where,  it  is  assumed  that  w1,w2...wm,  are  the  M classes 
below  node  li,  in  the  tree. 

A similarity  criterion  between  the  two  decision  rules, 
d',  could  then  be  defined  as  the  ll-norm. 


d 


\\f  i(d)-f  i (d')l|  * T.  I Pc  (wk /d)-Pc  (wk/d')\ 


I 


121 


The  change  In  Pc(T),  denoted  by  APc(l)t  when  d is 
replaced  by  d'  at  node  lit  all  other  decision  rules 
remaining  unalteredt  iS| 

M 

A Pc ( T) = X P (w j)  .TT  Pc (wi /dk)  ^ .(Pc  <w j /d)-Pc (wj/d') ) 

Since  Pc(wj/dk)  < 1.  t it  follows  thatt 

M 

^Pc(T)  ^ Zpfwj)»<Pc(wj/d)-Pc  (wj/d')> 

.<  \\f  i(d)-f  i (d')li*  Hp(wj) 

i=' 

Hence,  the  smaller  the  distance  between  the  vectors  d 
and  d"t  as  measured  by  the  H-norm.  the  smaller  the  change 
in  Pc(T)f  when  rule  d"  replaces  d. 

Other  choices  for  the  norm  are  the  Euclidean  norm  or  the 
infinity  norm.  However,  for  the  rest  of  this  discussion,  the 
11-norm  is  assumed  to  be  the  distance  measure  used. 

Using  the  norm  defined  above,  the  clustering  problem  is 
that  of  splitting  the  set  of  rules.  0.  into  disjoint  sets. 
Cd).  C (2 ) • . . . C ( N)  . such  that  for  any  set  C(i). 

(a)  If  d.  d'  are  in  C(i). 

\\  f (d)  -f  (d')  \\  ^ r . r > 0. 

Here,  r is  the  spread  of  the  cluster. 

(b>  any  vector,  d D . is  in  some  cluster. 

C(i).  i— 1.2....N. 

One  method  of  performing  this  clustering,  is  to  compute 
f(d)  for  all  2^  vectors  in  D.  These  H-dimensional  vectors 
can  then  be  grouped  together  in  M-$pace  using  the  criterion 


^2^ 


(a).  Howeverf  such  a method  requires  the  computation  of  n. 
( 2)*^f  unc  t i on  s » such  as  Pc(wk/d)f  in  addition  to  the  large 
cost  of  storing  the  vectors  themselves. 

5. 1.2. 2.  Clustering  Vectors  in  M-space 

This  section  describes  a simple  method  of  clustering  the 
decision  vectors  at  any  node  using  the  similarity  measure 
defined  above.  The  resulting  clusters  are  optimal  in  the 
sense  that  each  cluster  can  be  described  with  a single 
prototype  vector  with  the  largest  number  of  'don't  care' 
bitst  *^"'*^*  One  would  like  to  partition  the  total  set  of 
decisonst  0.  into  sets*  C(k),  k =^1 , 2 * ..  . .N  , such  that, 

(1)  Any  vector,  d falls  in  at  least  one  set  C(k). 

(2)  For  any  two  vectors  d,  d'  in  Clk),  the  distance, 

l\f  (d  )-t(d' ) l\  ^ RO,  RO  > 0. 

where  , 

f(d)»C  Pctwl/d)  Pc(w2/d)  ....  Pc(wW/d)  3' 
and  \\*  ll  denotes  the  11-norm  introduced  earlier. 


Unlike  data  clustering  in  feature  space,  however,  the 
criterion  sought  to  be  optimized  here,  is  slightly 
different.  In  data  clustering,  the  criterion  maximized  is 
usually  some  ratio  of  between  cluster  scatter  to 
wi t h in-c I uster  scatter,  since  the  goal  is  to  partition  the 
data  into  disjoint  sets  whose  members  are  similar  within  a 
group  and  sufficiently  different  from  those  in  other  groups. 
In  the  decision  clustering  task,  our  goal  is  to  minimize  the 
NUHBtR  of  clusters,  where,  members  of  a cluster  satisy 
property  (2)  above.  The  reason  for  this  is  that  the 
resulting  set  prototypes  will  be  used  in  the  OP  procedure 
and  the  computations  needed  in  that  algorithm  increase 
exponentially  with  the  number  of  vec tor s ( p r o t ot y pe s ) to  be 


considered. 

Consider  the  following  simple  method  of  generating  the 
sets#  C(k),  and  their  prototypes#  denoted  by  tk.  for  each 
bit,  i,  i=1#?...m#  compute  the  figure  of  merit#  SC 4 ) 
yih 

S ( i ) = T P j (i  ) 

where  the  sum  is  over  all  classes#  j=1#2#...M. 

Order  the  m bits  using  their  figure  of  merit#  and  select 
the  "don't  care'  bits  of  the  prototypes  to  be  the  n lowest 
merit  bits  such  that, 

Y\ 

RO  >Zls  (ki)  (a) 

where  the  ki  are  such  that# 

S(k1)^  S<k2)^  S(k3),<  ^ A(km) 

and  n is  the  largest  integer  such  that  the  equ.  (a) 
holds. 

The  prototypes#  tk,  are  those  obtained  by  inserting  all 
possible  combinations  of  0/1  bits  in  the  m-n  bit  positions 
which  are  not  'don't  care'  positions.  Hence  the  number  of 
sets#  C(k)  created  are# 

m-/i 

N = 2 

Searching  over  the  prototypes  rather  than  the  original 
vectors#d#  has  thus  resulted  in  a decrease  in  computation  by 
a factor  of  2^  . The  next  section  defines  the  concept  of 

an  efficient  decision  rule#  which  allows  one  discard  many  of 
these  sets  C(k)  and  thus  effect  a further  reduction  in  the 
computational  burden  of  the  DP  algorithm. 

S.1.3.  Efficient  Decision  Strategies 

Even  after  the  clustering  of  decision  rules#  d D # 
applicable  at  a tree  node#  one  may  still  have  a large  set  of 


124 


possibilities  to  examine*  The  concept  of  an  "efficient" 
decision  rule  allows  one  to  reject  a large  number  of  rules* 
because  it  can  be  shown  that  the  optimal  rules  for  the  tree 
nodes  (i.e.  those  to  maximize  the  tree  performance)  must 
also  be  efficient*  Since  in  many  applications*  the  set  of 
efficient  rules  is  significantly  smaller  than  the  set  of  all 
rules*  this  property  of  the  optimal  decision  policy  can  be 
used  to  reduce  the  computations  required  in  the  dynamic 
programming  algorithm  of  Sec*5*1*1  for  finding  the  optimal 
"one-step"  (OS)  policy* 

Before  proceeding  to  describe  a method  of  obtaining  the 
set  of  efficient  rules  for  each  tree  node,  some  definitions 
are  needed* 

Definition:  Let  di  and  di"  be  two  distinct  decision 
rules  used  at  a tree  node*  li*  Let  f(di)  denote  the  vector 
function  of  correct  recognition  rates  for  the  M classes 
below  li*  when  rule  di  is  used  at  the  node*  i*e* 


assuming  W(li)  c{m1***wM>  * 

Then  rule  di  dominates  di"  if  f(di)  > f(di")*  i*e* 

fe  Pc(wj/di)  ^Pc(wj/di")  for  all  wju(li) 

and  strict  inequality  holds  for  at  least  one  class  wj * 

Definition:  A decision  rule*  di*  is  efficient  at  node 
li*  if  there  is  no  other  rule  di"  which  dominates  it* 


f(di)  * 


Pc (w1 /di  ) 
Pc(w2/di) 


Pc(wM/di) 


125 


The  following  theorem  establishes  the  fact  that  the 
optimal  decision  rules  at  the  tree  ncdes  must  be  efficient. 

Theorem  1 : Let  {li).  i=1t2f....N,  denote  the  nodes  of  a 
tree.  Let  Oi,  for  i=1,2,..N.  be  the  set  of  rules.  di  , 
applicable  at  the  corresponding  nodes,  li.  Then  the  set  of 
optimal  rules,  di*.  i=1.2..N.  which  maximize  the  tree 
recognition  rate  must  be  efficient  at  the  corresponding 
nodes.  I i . i=1  .2  .. .N. 

Proof : Under  the  assumption  of  statistically  independent 
node  decisions,  the  tree  performance.  Pc(T)  is  given  by. 

Pc(T)  = Y.  P(wi ),T\pc (wi/dk) 

to; 

Consider  some  particular  node,  say  dk  and  assume  that  dk 
is  not  efficient  at  Ik.  Then  there  exists  some  other  rule, 
say  dk'  such  that. 

PClwi/dk')  Pc(wi/dk)  for  all  wi  €.  W(lk),  and 
strict  inequality  holds  for  at  least  one  class  wj  in 
w(lk). Hence  if  all  other  decision  rules  are  kept  fixed  and 


dk  is  rep  laced  by 

dk  ' 

then 

the  value  of  Pc(T) 

can 

only 

increase  since  in 

the 

equation  for  Pc(T)  given  above. 

the 

sum  of  the  product 

terms  for 

classes  under  node 

Ik 

will 

increase.  Hence  ok 

c an 

never 

be  an  optimal  rule 

at 

Ik. 

because  exchanging 

dk 

for 

dk'  would  increase 

the 

t ree 

performance.  Since  the  choice  of  the  node  Ik  was  arbitrary, 
the  same  argument  holds  for  any  other  node.  and  hence  it 
follows  that  the  optimal  rule  at  every  node  must  also  be  an 
ef  f ic i ent  rule  . 

The  preceding  definitions  and  theorem  hold  for  any  type 
of  decision  rule,  whether  it  uses  oiscrete  or  continuous 
features  to  make  the  branching  decision  at  the  node. 


1?6 


However,  the  remaining  discussion  in  this  section  will  be 
concerned  with  the  case  of  discrete  non-metric  features, 
each  possessing  m states,  since  in  such  cases  the  number  of 
rules  in  any  set,  Di,  becomes  large  for  large  m,  eg.  |ci\  = 
2^,  for  a binary  decision  tree.  One  can  always  find  the 
set  of  efficient  rules  at  any  node,  li,  by  computing  the 
function,  f(di)  for  all  die  Di , and  finding  the  rules  which 
are  not  dominated  by  any  other  rules.  However,  for  the 
discrete  feature  case,  this  'brute-force'  method  is 
impractical  for  large  m.  In  this  section  , we  propose  the 
use  of  a branch-and  -bound  method  for  finding  the  set  of 
efficient  rules. 

This  method  consists  of  sequentially  assigning  bits  of  a 
prototype  vector,  t,  the  values  0 or  1,  and  at  each  step, 
computing  a lower  bound  and  upper  bound  on  the  vector 
function,  f(d)  for  all  vectors  d,  represented  by  that 
prototype.  Whenever  the  upper  bound  function  becomes  less 
than  the  function  f(d*)  for  some  known  efficient  rule,  d*, 
the  entire  set  of  rules  represented  by  the  prototype,  t,  can 
be  discarded  as  not  being  efficient,  since  they  would  be 
dominated  by  d*.  The  lower  bounding  function  is  useful  for 
finding  a group  of  rules  that  is  guaranteed  to  contain  a 
rule  d,  which  is  NOT  dominated  by  d*.  This  is  done  by 
checking  if  any  component  of  the  lower  bounding  function  for 
t has  a value  greater  than  the  corresponding  component  of 
f(d*).lf  this  is  the  case,  it  follows  that  none  of  the  rules 
in  that  set  can  be  dominated  by  d*. Hence  members  of  that  set 
^ are  potential  candidates  for  inclusion  in  the  set  of 

efficient  strategies.  Whenever  a set  of  rules  is  discarded 
as  being  not  efficient,  the  algorithm  backtracks  to  the 
previous  stage  and  tries  a new  bit  assignment  to  the 
prototype,  that  it  has  not  yet  tried.  The  algorithm 

terminates  when  it  has  found  all  sets  of  rules  that  are 


127 


doninated  by  a given  rulei  d*.  During  its  execution  it  also 
keeps  track  of  potential  sets  that  contain  efficient  rules. 
These  rules  can  be  then  used  to  discard  more  rules  not 
already  discarded.  The  algorithm  is  formally  described  below 
for  the  case  of  a binary  decision  nooe*  though  it  can  be 
readily  extended  for  use  at  a M-way  decision  node*  where 
M>2. 


The  lower  and  upper  bounds  on  the  vector  function  f(d) 
for  rules  in  the  cluster*  D(t)*  are  obtained  in  the 
following  way.  Let  N0(d>*  and  Nl(d)  be  the  set  of  indices* 
i € < 1 * 2 * . . .m>*  such  that  dj=0  or  1 respectively.  These 
represent  the  states  of  the  observed  feature  for  which  the 
decision  is  to  go  left/right  at  that  node.  If  rule  d is 
used  at  node  I * and  U0(l)  and  will)  are  the  sets  cf  classes 
distinguished  at  I*  then  for  any  class  wi  C UO  U wl  * the 
correct  recognition  rate  at  that  node  is* 


Pc(wi/d) 

if  wi£Uo(l> 

t 

j(.N0(d) 

= ^Pi  ( 3 ) 

if  wi € W1 ( 1) 

• 

j£Nl (d) 

Bounds  on  Pc(wi/d) 

for  rules 

dG  pit)  can  then 

obta ined  as  * 

Max  Pc(wi/d)  1 

-lpi(j)  if 

wi  £ Wo  ( 1 ) * 

dtDCO 

J t H.Ct) 

1 

- J[pi(j>  if 
/e  Ho  a) 

wi  e UK  1)  . 

Min  Pclwi/d)  ^ 

1 j > i 1 

wi  G.  WO  ( 1 ) * 

JeDa) 

jj' £ NoCt) 

, 128 

2^pi(j)  if  wituKl)  , ..(5.3) 

Using  equations  (5.3)  which  bound  each  component  of  the 
vector  function  f(d),  the  bounds  on  the  vector  f(d)  can  be 
defined  as* 

Min  f(d)  l(t)  * [ lift)  I2(t)  ,...lM(t)  3 where, 

cifel)CO 

li(t)  = ’^pi(  j)  if  wiCWOd)  , 

^ € VioCt) 

^pi  f j ) if  w i € W1  ( 1) 

Max  f(d)  ^ u(t)  * t u1(t)  u2(t)  ....uM(t>  3 where, 

A 6 

ui(t)=  1 -^pi(j)  if  wi€wO(l)  , 

j £ 

1 i < j > i 1 w i €.  W 1 ( I ) 

J£ 

The  branch-and  bound  algorithm  which  uses  these  vector 
bounds  to  detect  all  rules  dominated  by  a given  rule,  d*,  is 
descibed  next. 


I 

i 

f 

i 


I 


129 


Algorithm  : 

Variables  used:  d*  is  a rule  assuired  to  be  efficient  at 
the  start  of  the  algorithm.  The  program  finds  all  rules 
dominated  by  d*  and  adds  them  to  a listf  RL  of  'rejected 
rules'.  It  also  finds  all  rules  NOT  dominated  by  d*  and  puts 


them  in  list 

EL 

of  possible  efficient 

rules 

for 

c ons ide  ra  t i on 

la  te  r . 

t denotes  a 

prototype 

initialized 

1 0 

all  ' - ' bits. 

at  the 

start.  l(t) 

and  u ( t ) 

are 

bounds 

on 

f(d)  for  rules 

in  D(t)  . The  bits 

of  t are 

set  to 

0 or 

1 

i n 

a fixed  trder. 

At  St  age  k , the  k 

t h . bit 

of  t 

i s 

being 

changed.  The  program 

uses  m flags 

t c it  i = 1 . 

. m to 

keep 

track 

of  the  bit  assignments  of  t that  it  has  tried. 


1 n i t i a I i/ c : Se t t = (----..-),  I ( t > = ( 0 * 0 f . . 0) t 


u(t)  = ( 1 ,1 

t . . 1 ) t RL 

= EL  = 0 t the 

empty 

sett 

and 

stage 

var iab  le  t 
Step  1 

k=1  . 
:(Set  k tl 

i.  bit  of  t)  If 

k = 0 

STOPt 

e 1 se  t 

set 

ck  = clt^1  ; if  ck  > 2\go  to  step  4 ; if  ck=2  set  tk  = 1»  if  ck  = 1 
set  tk=0. Compute  l(t)  and  u(t)  and  continue. 

Step  2:  (Check  if  r\jle  t is  to  be  discarded  ) If  f(d*)  > 

u(t),  put  t in  RL  and  go  to  step  It  else  continue. 

Step  3:  (Check  if  t dominates  d*  ) If  l(t)  > f(d*)  t then 
put  t in  ELt  replace  d*  by  any  rule  d'fi  0(t)  and  go  to  step 
1;  else  check  if  there  is  some  component  li(t)  such  that 
li(t>  > fi(d*) t and  if  there  is  onet  Pot  t in  EL  and  go  to 
step  1;else  if  no  such  component  is  foundt  90  to  step  3. 

Step  4:  (Backtrack  to  previous  stage)  Set  ck=0t  tk^''-', 
and  k=k-1  and  go  to  step  1 . 

Step  5:  (Increment  stage  variable  and  check  for  last 
stage)  If  k =m  put  t in  EL  and  go  to  step  1;  else  set  k=k^1 
and  go  to  step  1 . 


130 


Example  1 illustrates  the  execution  sequence  of  the 
algorithm*  The  class-conditional  probabilities  of  the 
feature  (5  states)  are  tabulated  below  for  each  of  the  4 
classes*  The  classes  EwIfUZ)  are  to  distinguished  from 
{w3tw4>  using  this  feature*  The  algorithm  is  used  to  find 
all  rules  dominated  by  d*=(01110>  an c those  prototypes  that 
could  contain  rules  not  dominated  by  it.  Fig.  5*4  shows  the 
search  tree  expanded  in  the  order  inoicated  by  the  circled 
numbers*  The  vector  t*  and  its  bounds*  lit)  and  u(t)  are 
shown  at  some  nodes*  The  circled  terminal  nodes  indicate 
rejected  rules*  while  those  put  in  list  EL  for  examination 
later*  are  marked  by 

Ex  amp  I e 1 : 

The  c la ss- cond i t iona  I probabilities  of  the  feature  are 
shown  in  the  table  below* 


Class 

pill) 

pi(2) 

pi  (3) 

pi  (4) 

pi(5) 

1 

n 

0*  1 

n 

*04 

mm 

mm 

0*1 

0*4 

WM 

|H 

.07 

Hi 

0*  1 

0*2 

Hi 

0*1 

Example  ? : 


The  following  example  depicts  the  variation  in  the 
algorithm's  search  efficiency  (measured  here  by  the  number 
of  nodes  expanded)  as  a function  of  the  average  performance 
of  the  rule*  d«*  The  table  shows  that  as  the  rule  d*  becomes 
'perfect'  (i*e.  error  rate=0)*  the  number  of  nodes  expanded 
in  the  search  tree  as  well  as  the  sum  of  prototypes  added  to 


131 


the  lists?  RL  and  EL?  decrease  rapidly.  This  result  is  not 
surprising?  since  the  closer  the  value  o1  each  component  of 
f(d*)  to  unity?  the  quicker  the  search  uill  detect  rules 
(prototypes)  such  that  u(t)  < f(d*)  . Hence  a large  number 
of  possibilities  would  get  rejected  rather  quickly.  The 
example  used  to  generate  this  table  employed  a feature  with 
6 states?  which  was  used  to  separate  three  classes  from  two 
other  c lasses. 


^eoAcV^  T/t&e  foy  ^XaTr>y\e  1 . 


133 


5«2«  Feature  Selection  at  Tree  Nodes 

A method  for  optimally  selecting  the  order  of  feature 
measurements  along  every  tree  pathi  uas  outlined  in  Chapter 
4*  Howevert  that  method  involves  a very  high  computational 
and  storage  costf  even  for  moderately  sized  feature  sets  and 
feature  states*  for  example*  if  there  are  N features*  each 
taking  on  m states*  then  for  a node  n levels  deep  from  the 
root*  the  optimal  feature  to  be  measured  must  be  tabulated 
for  all  possible  n feature  subsets*  and  all  possible 
measurements  of  these  features.  This  requires* 

approximate  ly  * 

/n\  n words. 

\ni  (m\ 

To  avoid  this  storage  cost*  one  could  take  the  opposite 
approach  of  not  taking  into  account  the  hisory  of 
measurements  leading  to  a node*lk*  but  rather  choose  the 
feature  at  a node  on  the  basis  of  maximum  nude  performance* 
Pc(lk).  It  was  shown  in  Chapter  3 that  this  way  of  selecting 
features*  does  not  necessarily  result  in  the  optimal  tree 
performance.  Ana  logo us  to  the  discussion  on  decision  policy 
design*  one  seeks  to  steer  a middle  course*  viz.*  choosing 
the  feature  at  a node  taking  into  account  the  features  used 
above  and  below  it  in  the  tree*  while  not  taking  int  account 
the  actual  measurements  taken  on  the  sample  before  it 
arrived  at  Ik.  Two  methods  for  obtaining  the  optimal  (for 
statistically  independent  features)  solution  to  this  problem 
are  given  in  the  next  two  sections. 


S.2«1«  A Dynamic  Proyramming  Formulation 


1 34 


The  bottom-up  approach  of  dynamic  programming  can  be 
used  to  select  the  optimal  feature  to  be  used  at  each  tree 
node . S t a r t i ng  one  level  above  the  terminalsi  one  computes 
the  best  feature  to  be  used  at  that  node  for  all  possible 
feature  assignments  at  nodes  leading  to  that  node. The  best 
feature  choice  is  made  on  the  assumption  that  a maximum 
likelihood  rule  is  used  at  each  node  (OSMLR)  on  this  path, 
and  that  the  node  decisions  are  statistically  independent. 
For  a node  n levels  deept  one  must  therefore^  evaluate. 


P 


n\  if  there  are  N features 


Consider  the  tree  of  Fig.  5.5.  and  let,  Pc(vi/lk,f) 
represent  the  correct  recognition  rate  of  a sample  from  wi 
at  node  Ik,  if  feature  f is  used  there  and  a 

maximum-likelihood  policy  employed. The  total  tree 
performance  can  be  written  as  a sum  of  products  of  the 
Pc()'s  (see  equ.  (3.1)  ),  and  the  contribution  to  this 
performance  by  the  samples  that  filter  down  the  path, 
11,12,13,14,  is  given  by. 


135 


a 

C(11,12,13,U)=  Zp(wi).Pc(wi/l2,U2)Pc(wi/l2rfl2) 

L-i 

Pc(wWl3,n3>Pc(wi/U,fU> 

Thusf  one  can  define  the  optimal  feature  to  be  measured 
at  14,  f4*  as  a function  of  features,  f 1 1 , f I 2 , f I 3 »as 
f 4* ( f M , f I 2 , f I 3)  , which  maximizes  C ( 1 1 , I 2 , I 3 , 1 4 ) , when 
fl1»fl2t1l3  are  used  at  11,12,13,  respectively.  Similarly, 
f 5* ( f 1 1 , f 1 2 , f I 3)  can  be  computed.  Working  up  one  level  in 
the  tree,  the  optimal  feature,  f3*(fl1,fl2)  is  computed  as 
that  which  maximizes  the  contribution  of  the  path,  11,12,13, 
viz  C(11,12,13),  when  fl1,fl2,fl3,  are  used  at  11,12,13,  and 
f 4 * ( f I 1 , f I 2 1 1 1 3) f and  f 5 * ( f 1 1 , f 12 , f I 3)  are  used  at  14,15.  In 
this  way  f1*  is  computed  as  the  best  feature  to  measure 
first.  By  table  look-up  in  the  tables,  f2*(fl1), 
f 3* ( f 1 1 , f 1 2 ) , etc,  successively,  all  the  features  can  be 

de  t e rm  i ned  , The  tables  f4*(),  f5*0  are  optimal,  since  all 
possible  features  are  considered  at  these  nodes,  for  a given 

feature  useage  on  the  path  11,12,13.  The  optimality  of  the 

above  algorithm  then  follows  by  induction  on  successively 
higher  level  tree  nodes,  and  from  the  fact  that  the 
contribution  of  each  subtree  below  a node,  Ik,  to  the 

goodness  measure?  cll1t***lk)  is  additive.  Since  the 
maximum  value  of  C(11)  is  the  optimal  tree  performance,  the 
corresponding  feature  assignment  to  the  nodes  is  also 
optimal. 


5*2*2*  A Branch  and  Bound  Formulation 


In  this  sectiont  we  describe  an  alternative  method  for 
obt<.'  ning  the  optimal  feature  assignment  for  a given  tree 
skeleton.  This  procedure  falls  in  the  category  of 
branch-and-bound  methods#  which  have  found  wide  use  in  many 
combinatorial  optimization  problems.  To  the  best  of  our 
knowledge#  the  use  of  this  method  for  selecting  the  optimal 
features  for  a tree  skeleton  is  new. 

The  branch  and  bound  method  assigns  features  to  the  tree 
nodes  in  a sequentiall  top-down)  manner.  Whenever  the 
assignment  of  a feature  reduces  the  optimality  criterion 
(e.g.  correct  recognition  rate)  below  a lower  bound#  that 
assignment  sequence  is  abandoned.  The  algorithm  backtracks 
to  the  previous  stage(node)  and  tries  a new  sequence.  At 
the  conclusion#  the  resulting  sequence  is  the  optimal 
feature  assignment.  The  following  notation  is  used  in  the 
discussion  below: 


F-total  set  of  N features. 
flk=feature  assigned  to  node  Ik. 

Pc (ui /f Ik ) =p rob.  of  correctly  classifying  a sample  from 
wi  into  UO(lk)  (or  U1(lk))t  using  flk.  and  a 
maximum  likelihood  rule. 

Ln* ( 1 1 , 1 2 . . I n) =an  ordering  of  the  nodes,  such  that, 
if  Ij  precedes  li  on  a patii  from  the  root  to  a terminal 
then,  j < i . 

Lm  = ( 1 1 , 12, ..  lm)=a  prefix  of  length  m of  Ln,  representing 
a subgraph  Tm  of  tree  , T. 

Pc ( f 1 1 , f 12  , . ,f Im  )*cor rec t recognition  rate  of  Tm,  using, 
fl1,fl2,,,,  at  the  m nodes  of  the  subtree,  Tm.This  tree 
classifies  a sample  into  m^1  sets.  If  m^n,  the  sets 
are  single  classes,  else  they  are  unions  of  classes. 
Pc*(m)=  max  recognition  rate  of  Tm 
s Max  Pc  (f  1 1 , f 1 2 , . . f Im  > 

= Max  ^CP(wi).  V\Pc(wi /f  lk>] 

V -f/, , , - - G.  ^ 


139 


Branch-and  Bound  Algorithm. 


Variables  used:  tij=  array  of  flags^  in  which  tij=1 

denotes  that  fi  has  been  assigned  to  nodet 

B=lower  bound  on  the  optimal  recognition  rate  of  7, 

Pc  * (n ) . 

Step  Oflnitialize):  Set  tij*0,  i=1|2«..m,  j=1f2t..n 
Choose  an  initial  assignment  of  featureSf  and  set  B 
to  the  recognition  rate  for  that  choice. 

Set  k=0  (stage=0)f  and  L0  = O»  the  null  sequence. 

Step  Ifchoose  next  feature):  Compute  Lk  from  Lk-1, 
as  Lk  = ( Lk- 1 » fa ) t wheret  Pc ( Lk-1 * fa ) =Ma x Pc(Lk-1,f). 

^ i '-K-v 

step  ?(  Test  against  lower  bound): 

If  Pc(Lk-1,fa)  < B go  to  step  4 else  go  to  step  3. 

step  3 (Check  for  last  stage):  If  k=n  go  to  step  6 
else  set  t ak  =1  * k^k'»1  « and  go  to  step  1. 

Step  4(Bac kt rack  ) : Set  tjk*0t  j = 1t2...m,  k=k-1. 

If  k=0  STOP,  else  go  to  step  5. 

Step  5(  Seek  new  feature  at  previous  stage): 

If  tjk=1,  j=1.2...m  go  to  step  4 else  go  to  step  1. 

Step  6(  Update  lower  bound): 

Set  0=Pc(Ln).  and  record  Ln.  the  optimum  assignment  so 
far  then  go  to  step  4. 

The  above  algorithm  satisfies  the  following  properties. 


140 


Theorem  1:  Let  Pc*<n)  be  the  optimal  recognition  rate 
achievable.  Then,  the  algorithm  evaluates  every  sequence, 

1 1 1 , 1 1 2 , . • f Ik , which  has  the  property  that. 

Pc  C f I 1 , i I 2 , . . 1 1 k ) Pc*(n),  for  1 ^<  k ^ n . 

Proof  : P c ( f I 1 , f I 2 , • . f I k ) ^ Pc*(n)  6 , for  any  value  of 
B that  occurs  in  the  execution  of  the  algorithm.  Hence,  this 
feature  sequence  will  be  evaluated,  since,  the  algorithm 
evaluates  all  sequences  whose  pc  value  exceeds  the  lower 
bound  ,B,  at  that  time. 

From  Theorem  1,  one  obtains, 

Theorem  2:  If  in  the  algorithm,  we  keep  track  of  the 

maximum  value  of  Pc(k)  at  the  k th.  stage,  then  at  the 
conclusion  of  the  algorithm,  this  value  is  equal  to  Pc*(k), 
k=1,2...n.  Hence,  one  also  obtains  the  solution  to  the 
optimal  assignment  problem  for  partial  classification  trees 
LTm},  m-1,2,..n. 

proof:  From  property  (i)  stated  earlier, 

Pc*(k)  >,  Pc*(n)  for  k ,<  n.  Hence, 

Pc*(k)  ^ B,  any  lower  bound  on  Pc*(n), 

At  the  k t h.  stage,  we  examine  all  sequences  Lk  such 
that  Pc(Lk)  ^ B,  and  hence,  ALL  Lk  such  that  Pc(Lk)  > 
Pc»(n),  If  Lk«  is  the  optimal  k-sequence,  i.e. 
Pc(Lk*)=Pc*(k),  then  this  sequence  also  satisfies  PclLk*)  V 
Pc*(n)  >y  B.  Hence,  it  will  be  evaluated,  and  by  keeping 
track  of  the  maximum  value  of  PcO,  we  obtain  Pc*(k)  when 
the  algorithm  terminates. 


U1 


5.2.3.  Feature  Ranking 

The  last  two  sections  proposed  alterntative  methods  of 
obatining  the  optimal  feature  assignment  for  a given  tree 
skeleton  under  the  assumption  that  an  OSMLR  rule  is  used  at 
each  node.  The  number  of  features  to  be  considered  at  a 
particular  node  during  this  search  for  the  optimal 
assignment,  can  often  be  reduced  by  feature  ranking.  for 
each  feature  in  the  total  set  of  features,  one  could  compute 
the  vector  of  correct  recognition  rates  for  the  classes 
under  a given  node.  Thus  one  can  define  a goodness  measure, 
g(f,li)  for  all  feaures  1£  f at  node  li,  as, 

g(f,li)  = C Pc(w1/li,f)  Pc(w2/li,f)  3 

where  w1 , w2,.«.  are  classes  in  W(ti),  and  each 

component  of  the  vector  measures  the  probability  of  correct 
decision  on  a sample  from  that  class  at  that  node,  using 
feature  f and  an  ObPLR  rule.  Analogous  to  the  concept  of 
efficient  decision  rules  (Sec. 5. 1.3),  one  can  then  define 
the  notion  of  dominance  among  features  and  that  of  efficient 
features  at  each  node. 

Definition:  A feature  f dominates  feature  f " at  node  li 
if  g(f,li)  > g(f'',li)  , i.e. 

Pc(wj/f,li)  ^Pc(wj/f'’,li)  for  all  wj€W(li)  with  strict 

inequality  holding  for  at  least  one  class,  wj. 
Definition:  A feature  f is  efficient  at  node  li  if  it  is 
not  dominated  by  any  other  feature,  f'  ^ f at  that  node. 

It  can  be  easily  shown  that  the  optimal  features  for  a 
given  tree  skeleton  must  be  efficient.  The  proof  is 
identical  to  that  employed  for  efficient  decision  rules  in 
Sec. 5. 1.3  and  is  therefore  omitted  here. 


142 


6«  Estimation  Of  Decision  Rules  From  Finite  Samples 

In  the  previous  chapters*  ue  have  discussed  optimization 
methods  for  the  design  of  hierarchical  classifiers*  when  the 
criterion  of  optimality  is  a weighted  sum  of  measurement 
cost  and  mi scl as  si f i cat  ion  risk.  These  methods  assumed  that 
the  required  parameters  such  as  the  c las s -c ond i t i ona I 
probabilities  of  the  features  used  * were  either  known  or 
could  be  estimated  with  a high  degree  of  confidence  from 
training  samples.  The  classifier  design  was  therefore  not 
influenced  by  the  number  of  samples  in  the  design  set. 
However*  when  the  sample  size  is  'small',  the  degree  of 
confidence  in  the  estimated  parameters  is  low*  and  one  would 
like  to  use  the  samples  as  efficiently  as  possible.  A 
classification  scheme  that  requires  too  large  a set  of 
parameters  to  be  estimated,  could  in  such  cases  perform 
worse  on  a independent  test  set  than  a scheme  which  requires 
fewer  paramete  rs  . 

The  number  of  parameters  requirec,  is  a function  of  the 
number  of  features  used  at  each  node,  the  number  of 
statrsflevels)  of  each  feature  (if  the  features  are 
discrete),  and  the  decision  complexity  at  each  node. 
Decision  complexity  refers  to  the  number  of  sets  (of 
classes)  distinguished  at  a node  (e.g.  this  is  two  for  a 
binary  tree).  As  an  example,  consioer  the  following  two 
tree  structures  for  a 6>class  prohlem. 


143 


Assume  that  at  each  tree  node*  a maximum  likelihood  rule 
is  used  to  distinguish  the  sets  of  classes.  In  T1,  the  top 
node  uses  feature  f1  to  separate  the  two  mixtures» 
<w1vw2tw3)  and  <w4yw5fw6>t  while  at  its  lower  level  nodest 
there  is  a 3-way  decision  made  using  the  observations  of 
features  fZ  and  f3.  In  T2  each  node  decision  is  binary. 
Assuming  that  each  feature  has  m states  and  the  features  are 
class-  conditionally  statistically  independent*  the  number 
of  parameters  required  for  trees  T1  and  t2  are* 

Np(Tl1  ~ 2.m  ^ 2.3.2.m  * 14.m 

Np(T2)  — 2.m  ^ 2.2.m  ^ 2.2. m ~ 10. m 

The  tree*  T2*  therefore  requires  fewer  parameters  than 
Tl.  In  theory  (i.e.  assuming  perfect  knowledge  of 

probabilities)  Tl  should  perform  better  than  T2*  since  the 
classes  <w1*w2*w3)  are  being  separated  in  Tl  using  features 
12  and  f3*  while  in  T2  this  subproblem  is  solved  in  two 
stages*  end  the  discrimination  between  w2  and  w3  makes  no 
use  of  information  in  f2.  However*  since  T2  requires  fewer 
parameters  than  Tl  * for  small  sample  sixes*  one  can  expect 


T 


that  the  estimated  maximum  likelihood  rules  at  the  nodes  of 
12  could  be  designed  with  greater  confidence  those  of  T1, 
Hence  the  mean  accuracy  of  T2  could  be  better  than  that  of 
T1  for  small  design  sample  sice.  By  mean  accuracyt  we  mean 
the  average  performance  of  the  particular  c I a s s i f i e r ( T 1 or 
12)  On  an  independent  test  sett  designed  using  say  N samples 
per  class. 

This  chapter  discusses  some  of  the  quantitative  aspects 
of  the  relationship  between  sample  size.  complexity.  and 
classifier  performance,  for  the  case  when  the  features  are 
discrete  and  c la ss -c ond i t i ona 1 1 y statistically  independent, 
and  under  the  assumption  that  a maximum  likelihood  rule  is 
used  at  each  tree  node  of  a hierarchical  classifier.  This 
analysis  is  carried  out  in  the  following  phases. 

first,  it  is  shown  that  for  ‘'most"  probability 
structures,  the  variance  of  the  estimated  parameters  of  a 
mixture  of  n populations  (classes)  is  less  than  the  sum  of 
the  variances  of  the  estimates  of  the  individual  class 
probability  functions,  assuming  that  the  number  of  samples 
available  is  proportional  to  the  apriori  probability  of  the 
classes.  This  result  provides  a heuristic  argument  for 
stating  that  the  confidence  in  estimating  a maximum 
likelihood  rule  for  splitting  a set  of  classes  into  two 
sets,  would  be  greater  than  the  confidence  in  estimating  a 
rule  that,  in  a single  step,  splits  a set  of  classes  into 
the  component  classes. 

We  then  derive  an  expression  for  the  mean  accuracy  of  a 
decision  tree  used  for  an  H-class  problem.  The  mean  accuracy 
is  defined  as  the  performance  of  an  estimated  (using  a 
finite  sample)  hierarchical  classifier.  averaged  over  the 
'space  of  all  problems'.  It  is  assumed  that  all  problems 
are  equally  likely.  Though  the  resulting  expression  is  too 
general  to  enable  any  precise  design  criteria  to  be  derived. 


1A5 


it  provides  a result  which  is  qualitatively  significant.  It 
is  shown  that  for  a given  sample  size.  there  exists  an 
optimal  quantization  complexity!  i .e . by  increasing  the 
number  of  quantization  levels  i n ce f in i t e ly  , one  cannot 
achieve  increasingly  better  performance.  While  this  result 
has  been  reported  by  others  C33*343f  our  analysis  extends 
the  result  to  a multiclass  (more  thar  two  classes)  problem. 

The  above  result  establishes  a relationship  between 
optimal  quantization  complexity  and  sample  size.  The  Last 
section  in  this  chapter  investigates  the  relation  between 
decision  complexity  and  sample  size.  It  shows  that  for  small 
sample  sizest  it  may  be  better  to  split  a multiclass  problem 
hierarchically  into  simpler  problems  (i.e.  into  problems 
with  a smaller  complexity  ) than  to  design  a single-step 
classifier  which  uses  all  features  together  to  make  an  M-way 
decision. 

6.1.  Estimation  Of  Discrete  Probabilities  Of  Mixtures 

Assume  that  one  wants  to  estimate  the  probability 
distributions  of  samples  from  M classest  wit  i=1tCt....M. 
for  simplicity!  each  sample  is  regarded  as  a single  feature 
observation!  where  the  feature  can  have  one  of  m 
va  I ue  s ( s t a t e s ) . Let  pi(j),  for  j=l!...m!  i = 1t2!..M,  denote 
the  probability  that  the  feature  will  be  in  state  j given 
that  it  came  from  class  wi.  The  maxiirum  likeihood  estimate 
of  pi(j>!  written  as  pi(j)t  is  given  by t 

A 

pi(j)  * ni(j)  / Ni 

where!  Ni  are  the  number  of  labelled  design  samples  from 
wit  and  ni(j)  are  the  number  of  samples  in  it  that  were  in 
state  j.  Assuming  that  the  samples  were  independent  and 
identically  distributed!  the  marginal  probability  of 
obtaining  a particular  value  of  ni(])  is  given  by  a binomial 


T 


146 


d i s t r i but  i on  y vi2*y 


Pr[ni(j)J  = b(Niyni(j),pi(j)) 


yn  i ( j Cp  i { j )D  . C1-pi(j>] 

The  mean  and  variance  of  ni(j>  and  pi(j)  are  given  byy 


A 

tCni(j)]  = Ni.pi(j)  E[pi(j)D  = pi(j) 

Varfni(j)D  * N i . p i ( j ) . C 1 -p i ( j ) 3 

VarCpi(j)3  * p i C j ) . C 1 -p i < ) ) 3 / Ni  (6.1) 


Consider  the  problem  of  estimating  the  state 
probabilities  assuming  a sample  came  from  the  mixture  of 
populations  of  w"y..wM,  yith  known  mixing  weights.  P(wi)y 
i=1y2y«.n.  Let  p(j)  denote  the  probability  of  observing 
state  j.  There  are  two  ways  of  estimating  the  parametersy 

p(j)y  j“1y2...m. 

In  the  first  scheme,  assume  that  one  could  choose  Ni 
independent  samples  from  each  of  the  populations.  wi. 
i=1i2..n.  Then  a maximum  likelihood  estimate  of  p(j)  is 
given  by. 

p(j)  * P(wi).  ni(j)  /Ni 

where  ni(j)  has  the  connotation  defined  earlier. 

from  equat  ion  C6.1') . it  follows  that. 

Elp(j)3  » P(wi).ECpi(j)3 

h 

Var[pi(j)3  = X-P(wi)  . VarCpi(j)3 

A 

< ^VarC  pi(j)  3 since  P(wi)  4 1.0 

431 

For  the  case  when  all  classes  have  the  same  apriori 


147 


probabilityt  P(wi)t  the  above  inequality  reduces  to* 

VarCp(j)]  < 1/M  . Max  (VarC^iCj)]}  ...••.•••••(6.2) 

The  above  analysis  shows  that  the  variance  of  the 
mixture  parameters  in  such  an  estimation  scheme,  is  less 
than  the  sum  of  individual  class  parameter  variances,  and 
for  equal  apriori  class  probabilities,  it  is  less  than  1/  M 
times  the  maximum  variance  of  the  class  parameters. 

An  alternative  method  of  estimating  p(j)  might  consist 
of  taking  N independent  samples  from  the  mixture  population. 
The  probability  that  in  a random  saeple  so  selected,  the 
feature  will  be  in  state  j,  is  given  by, 
hA 

p(j)  sZp(wi).  pi(i)  (6.3) 

If  n(j)  denotes  the  number  of  sacples  out  of  N that  were 
in  state  j,  then  a maximum  likelihood  estimate  of  p(j>  is, 

p(j)=  n(j)/N  . 

Since  each  sample  has  a probability  p(j>  of  being  in 
state  i,  the  marginal  distribution  of  n(j)  is  binomial,  and 
given  by, 

Pr[n(j)D  = B ( N ,n ( j ) , p ( j ) D , where  p(j>  is  given  by 
(6.3). 

For  the  case  of  equal  apriori  class  probabilities, 
P(wi),  the  variance  and  expectation  cf  p(j)  are, 

H 

£[p(j)3  = p(j)  * 1/M  .^p  i ( j ) 

i*\ 

Var[p(j)3  * p( j ) . f 1 -p ( j ) 3 / N 

» 1 ^i(j)  CM  “^pi(j)3 

* X 

M^  N 


148 


Denoting  by  xi  the  quantity  pi(j)t  it  follows  foia  the 
above  equation  and  (6.1),  that* 

M 

VarCp(j)]  ^ T.Var  tpi  ( j ) D t 
i.'rt 

assuning  that  N/M  samples  were  available  to  estimate  the 
pi(j)y  if  the  following  condition  holds: 

1/M*-  .(Hxi  ) • ) ,5  M .^xj.ll-xj) 

^ y J’ 

The  above  inequality  reduces  to* 

(M^  -M  )»X.*j  *^^xj.xk  -(M^  -D.^(^xj)  < 0«  (6.4) 

5 * y i 

Equation  (6.4)  is  an  ellipsoidal  surface  in  (M-1) 
dimensional  space  that  passes  through  the  origin  and  the 
pointf  (1t1t**1)t  and  has  an  intercept  on  each  axist  xj t 
given  byt 

xj  = (M*  -M  ) / (H^  -1)  . 

Each  point  in  this  space  represents  a set  of  class 
probab  i I i t i e s t pi(j)«  i =1  « 2 t • « • t and  all  situations  where 
the  point  falls  within  the  ellipsoid  represents  a case  where 
the  mixture  parameter  has  a variance  less  than  the  sum  of 
the  class  parameter  variances.  For  a targe  number  of 
classes.  the  intercept  on  each  axis  tends  to  unity.  and 
hence  most  of  the  hypercube  in  which  the  probabilities. 
pi(j)  fall,  lies  within  the  ellipsoid.  and  the  previous 
statement  is  true.  Hence,  one  can  state  that  for  'most' 
situations,  the  mixture  parameter  can  be  estimated  with  a 
smaller  variance.  than  the  sum  of  variances  of  the 


U9 


I 


individual  class  parameters.  For  the  case  of  two  classes* 
M=2,  the  ellipse  obtained  from  (6.4)  is  depicted  in  the 
figure  below.  The  shaded  area  shows  the  region  in  which  the 
inequality  (6.4)  is  valid. 


ov\  OxeS 


Fig.  6,2 


6.2.  Mean  Accuracy  Of  A Hierarchical  Classifier 

The  mean  accuracy  of  a classifier  is  de f i ned C 3 3 * 34 3 as 
the  performance  of  a classifier  averaged  over  the  space  of 
all  problems.  A problem  consists  of  the  true  set  of 
parameters  defining  the  class  distributions.  In  this 
disscussion*  we  assume  that  a hierarchical  classifier  is 
used  to  distinguish  between  M classes*  using  discrete 
features*  whose  probability  functions  are*  <p^(j)>*  where  i 
refers  to  the  feature  number*  k is  the  class*  and  j rangess 
over  the  states  of  the  feature.  For  simplicity*  it  is 

A. 

assumed  that  the  parameters*  ^Fe  uniformly 

distributed*  i .e . all  problems  are  equally  likely. 

For  the  purposes  of  this  discussion*  we  assume  that  a 
binary  balanced  decision  tree  is  used  for  an  M->class 
problem*  where  each  class  label  occurs  only  at  one  terminal 
node.  Hence  the  number  of  levels  L*  in  the  tree  is* 


150 


I * I og^  M . 

It  is  assumed  that  a single  feature  is  measured  at  every 
nonterminal  node  in  the  tree^  and  ail  feature  values  are 
quanti2ed  into  m s t a tes ( leve I s ) • The  analysis  that  follows 
can  be  readily  extended  to  trees  which  are  not  binaryt  and 
which  use^ more  than  one  feature  at  the  nodes*  However*  the 
assumption  made  throughout  this  discussion  is  that  all 
features  are  c la ss-cond i t ional I y statistically  independent. 

Let  Nk  be  the  number  of  design  samples  from  class  wk , 
k = and  assume  that  these  are  proportional  to  P(wk) 

the  class  a priori  probabi  I ity  ,Let  <p^  ( j ) , , 

be  the  class-conditional  probabilities  for 
feature  i in  class  wk.  For  ease  of  notation,  we  also  assume 
that  feature  i is  used  at  node  li. 

. Then  a maximum  likelihood  estimate  of  the  parameters, 
P^f j ) , is  defined  as, 

P^(  j ) * n^(j  ) / Nk 
# 

where  n^(j)  are  the  number  of  wk  samples  whose  i th. 
feature  was  in  state  j. 

The  estimated  maximum  likelihood  decision  rule  at  li  for 
pm  a sample  (x  1 ,x  2,  ,,  , xm)  reaching  it  is. 

If  xi  is  in  state  j,  and, 

TnJ(j)  \ I!.n^(j)  , 
o.  VioHi)  W,  I 

then  go  left  below  li, 

else  go  right. 


T 


151 


When  the  samples  are  independentt  the  sampling 
distribution  of  the  counts*  <n^( j) , j=1*2».m>  is  a 

multinomial  with  m-1  degrees  of  freeoom*  given  by, 

^ . rtf) 

PrCn^l  j ) : j£(  1,m)3  = (Nk)l  | ] [ p^  ( j ) 3 


Jll 


m 

n ^^1 


....(6.5) 


< = • 


Therefore,  the  performance  of  a classifier  7,  designed 
using  a particular  design  sample  set,  is  , 

. - ^ TT\fV'^  - ^ 

Pc  (T,M/{pt  (}  ),<n|^(  j)>)  = 

K ^ •—  K K.. 


.p£(  j )\  .(6.6) 


where, 

Sk(j)  * 1 if  wk^WO(li)  and  T.nSj)  V^n^‘())  . 

« 1 if  wk€-Wl(li)  and  ^ ^ » 

UjpC^4-j 

* 0 otherwise. 


The  average  performance  of  a hierarchical  classifier 
designed  using  Nk  samples  from  class  wk,  k>1,2,..m,  can  be 
obtained  by  taking  the  expectation  of  ^ 

with  respect  to  the  sampling  oi s t r ibu t i ons  given  by 
equ*  (6.5).  This  expectation  operator  will  be  denoted  by  E". 
The  mean  accuracy  of  a balanced  binary  tree  for  an  M-class 
problem,  is  then  derived  by  averaging  the  expected 
performance  over  the  'space  of  all  problems',  using  some 
assumed  prior  distribution  of  the  parameters,  {p^(j)>.  This 
expectation  operator  is  denoted  ty  E.  The  resulting 
expression  for  the  mean  accuracy  is  given  by  equ.  (6.7).  The 


152 


derivation  of  the  result  is  described  next 


\S3 


cr 


(T.M)  = e|e'(Pc(T,M  |{p^(j)}.{nj(j)})]| 


(6.7) 


Assume  that  all  values  of  {p,  (1),  j e are  equally  likely,  under 

k 


the  constraint. 


m 


(j)  = 1 


Then  the  probability  density,  dP[^Pj^(j),  j e is  given  by  [ 33  ], 


«iP[{n^(j),  j e (l.m)}]  = (in-1)!  dp^(l)  • dp^(2) . . .dp^(m) 


Denoting  by  f the  integration  over  the  range  of  problems, 
f Pk(J)»  J ^ (l.ni).  k e (1,M),  i = 1,2,..}  , and  summing  over  all 
possible  design  samples,  {n^(j)},  Equ.(3)  becomes,  m 


PcrfT,M)  = 


P(wk)  • 


r.'  . 


-l'^S(wk)  ,1  = 1 


Pp|n^(J),  ,1  f (l,m)I 


1 t=l 


M 


(m-D!  dpj.(l) 


..dp*(m)  (6.8) 


1 1=1 


ts 


The  product  terms  indexed  with  node  labels  (features),  i,  can  be 
grouped  together  and  the  summation  over  the  states,  j,  interchanged 
with  the  summation  over  the  to  yield. 


M 

r 

L-i  p(w. ) 


Pcr(T.M)  . 

k-1  ^^eSCw^) 


lU 

J-'J  I 

, ^ j=l 

Up^(J)  . je(l,m)  ,Wj.eW(£^)} 

y y 

p^j) 

^ <nHj)}e 

t k 


PI  Pr[nJ(j 


)] 


w^eW(£^) 


I j (m-l) ! dp^ (1) . . .dpj(ra) 

Wj.eW(£^) 


(6.9) 


In  the  above  equation  represents  the  set  of  counts  { nj(j)  } 
such  that. 


or, 


Zj«5U)  Zj  ”?<J> 


if  Wj^  e w^(£j^) 


if  c 


», ' "l«l> 


155 


From  the  symmetry  of  the  problem,  It  follows  that  each  term  in  the 
summation  over  the  states,  j,  must  ba  the  same.  Hence,  the  mean  recog- 
nition rate  for  class  Is  m times  the  mean  recognition  rate  for  a sam- 


ple whose  features  are  all  in  state  1 (say).  Equ.(6.9)  can  therefore  be 


written  in  the  following  way,  after  substituting  for  the  protiabilfty 


distribution  of  (n^(l),  t c (1,M)} 


Pcr(T 


M 

,M)  P(w^)  - I I 

k«=l  ^j'^SCwk) 


„..  L E 


.)  • ' ' m[(m-l)!] 

^i‘=S(wk)l_  t 


(Pk(l)l  [1-Pk(])l 


pJ(i)=o 


1-pJO) 


' f J 


If  J 

1-PkH  )-p  (2)  . . .oMm-1) 


dp j (m) 


P,^(2)=0 


pi (m)=0 


n/  [pj(i)i"’"^n-p’(i)i  (N^)!  dpj^o) 

[nJO)l!  [N  -nf(l)l! 

WtCW(£i)  pj^(l)=0 

¥ Wk 

1-pJO)  l-pj(l)-...pj(i"-l) 

dp^(2) ^ dpl(m)  t 


^ 1 Nf-nt(l) 

[pj(1)l"’^^^]-pl(1)l  (N^: 


. . .(h.lO) 


p^ (m)^0 


where 


Number  of  classes  below  node  C 


In  Equ.(6J0)  refers  to  the  set  of  sample  counts,  {n^(l),  Wj.c  W(£^)} 


such  that, 

^nfU) 

WtCW^(£j) 


eWo(-^i) 


If  e 


if 


Each  of  the  multiple  integrals  in  the  last  product  form  in  Equ.(6.10) 
reduces  to  a Beta  function,  viz. 


(Nt)! 

[njd)]!  [Nt-ni(l)]! 


[pj(l)l 


"t(l) 


N.-n^(l)+m-2 
[l-pi(D]  . dpi(l) 


1_  (Nt)! 

(m-2)!  (Nj.+m-l)! 


. (Nt-ni(l)+m-2) ! 

(Nt-njd))! 


Similarlly  the  multiple  Integral  for  class  wj^  is  ev.iluated  as. 


1 


(^k)l  (Nk-n]L(i)4^_2)!  ^ ni(i)  + i 

(N,,4in-1)!  (N^-ni(l))! 


(m-2)i 


> 57 


Substituting  these  expressions  in  Equ.(b.lO)  yields  the  final  expression 
for  the  mean  accuracy  of  a binary  decision  tree  for  an  M class  problem, 
viz. 


P^r(T.M)  = 


t n M 

I / P (w.  ) • I I ni-(m-l)  L — / L — t 

k=l  ^ies(w,^)  {ni(D}  c U.^ 


>i(i)}  . vl 


n 

W(^i 


(Nt)!  (Nt-nt(l)+m-2)! 
(Nfhn-l)!  (Nt-njd))! 


*^k(l)+l 

Nj^+m 


(6.11) 


The  above  expression  for  the  mean  accuracy  is  too  complex  for  direct 
interpretation.  However,  for  small  values  of  sample  sizes,  Nj. , Equ,(6.11) 
can  be  written  in  a simpler  form.  In  particular,  we  shall  consider  the 
case  where  there  is  only  a single  sample  available  per  class,  i.e. 

Nj  - N2  = ...  Nm  = 1 

Since  each  variable  nj(l)  can  only  take  on  values  0 or  1,  the  multiple 
summation  over  in  Equ.(6Jl)  can  be  replaced  by  a double  summation  in  the 
following  manner.  If  there  are  classes  under  node  then  there  are 

Mj/2  classes  (and  hence  samples)  from  the  sets  of  classes  Wq(^^)  and 
Wi(^i),  respectively.  Assume  that  class  wj^  is  In  the  set  i.e. 


1 58 


it  is  below  the  left  of  the  two  nodes  below  Let  j 

number  of  samples  from  and  respectively 

i was  in  state  1,  i.e.  nj(l)*=l.  Then,  the  rule  at 


if 

h ^ 

J2 

go  left 

if 

3i  < 

32 

go  right  . 

2 and  ^2  denote  the 
'"or  which  feature 
would  be, 


A sample  from  w^^  would  be  correctly  classified  only  if  ^ ^2’ 

Hence  the  summation  within  the  square  brackets  in  Equ.(6.11)  can  be  written  as 

Ml/2  r 


the 


In  the  above  equation,  the  first  term  is  the  case  when  ni(l)=l,  while 
second  term  is  the  case,  ni(l)=0.  Upon  simplification,  it  yields 


Substitution  of  Equ.(6.12)  into  Equ.(6.11)  will  not  yield 


cm;  viurrect  vaj 


for  the  mean  accuracy,  because  the  case  - J2  has  been  treated  unsym- 
metrlcally  in  (612).  Hence,  to  equalize  the  correct  recognition  rates  for 
classes  on-elther  side  of  ^2.  we  assume  the  summation  over  to  be  from 
0 to  Mi/2  as  for  jj,  and  then  put  a weighting  factor  of  0.5  for  the  entire 


Experimental  Results 


The  table  below  shows  the  variation  of  the  mean 
accuracyt  PcrlTiM),  with  sample  size*  N«  quantization 
complexity  m*  and  the  number  of  classes  M.  For  any  given 
sample  size,  it  is  seen  that  there  is  an  optimal 

quantization  complexity.  This  optimal  complexity  increases 
with  increasing  sample  size.  Moreover,  for  a given  sample 
size,  the  optimal  complexity  for  an  M-class  problem  is 
greater  than  or  equal  to  the  optimal  complexity  of  a 
M'-class  problem,  if  M > M'.  Thus,  f cr  the  case,  N=5,  the 
optimal  complexity  for  a 2-class  problem  is  3,  while  for  the 
4-class  case,  it  is  4.  The  maximum  values  of  performance  in 
each  case  are  underlined  in  the  table.  Graph  2 shows  the 
variation  of  tree  performance  with  the  complexity  for 
various  sample  sizes  N and  number  of  classes  M. 


GR^PIi^ 


162 


6,3.  Hierarchical  Classifier  Versus  Cne-Step  Classifier 

This  section  compares  the  performances  of  a hierarchical 
classifier  and  a one-step  classifier  for  a particular 
problem,  vij.,  a given  set  of  parameters,  Cp^(j)>.  Though  it 
is  difficult  to  use  the  analysis  presented  here  to  make  any 
judgements  about  all  M-class  problems,  one  can  make  the 
following  qualitative  statement;  if  the  sample  size  is  small 
and  the  error  rate  of  a one-step  scheme  is  only  slightly 
better  than  a hierarchical  scheme  on  an  independent  test 
set*  then  it  is  probably  better  to  use  the  hierarchical 
scheme.  This  result  is  a consequence  of  the  fact  that  in  a 
decision  tree,  each  node  decision  is  simpler  than  the  M-way 
decision  in  a one-step  method.  Hence,  the  former  requires 
fewer  parameters  to  be  estimated  (discrete  features  assumed 
here).  The  result  is  counterintuitive  since  in  theoryCi.e, 
assuming  perfect  knowledge  of  all  parameters),  no  scheme  can 
do  better  than  a one-step  classification  rule  (such  as  a 
manimum  likelihood  rule)  which  uses  ALL  features  to  reach  a 
decision. 

Consider  the  two  alternative  classification  schemes 
shown  in  Fig.  6, 3a-b  below,  for  a A-class  problem.  Two 
features  (discrete)  are  available  and  assumed  to  be 
c I a s s-c ond i t i o na 1 1 y statistically  independent.  Since  the 
one-step  scheme  in  Fig.  6.3b  uses  both  f1  and  f2  to  make  the 
decision,  while  the  tree  uses  only  a single  feature  at  every 
node,  one  would  expect  the  former  to  perform  better.  After 
deriving  expressions  for  the  performances  of  these  methods, 
we  illustrate  by  example,  that  for  small  sample  sizes,  the 
tree  may  have  a lower  error  rate. 


\ 


163 


From  the  analysis  of  the  preceding  section,  the 
perfornance  of  a decision  tree  averaged  across  the  sampling 
distribution  is  given  by, 

E'L  Pc  (T,M/{p^(  j )>,<n{(j  )>)  3 
vhere  E'  denotes  the  expectation  operator  over  the 
distribution  of  the  counts,  <n|J  (j),  je<1tm),  k = 1,2,3,4, 
i = 1,2>.  Note  that  ve  do  not  average  over  the  pa  r ame  t e r s , { 

since  we  are  considering  a particular  problem.  with 
this  modification,  and  an  identical  analysis  as  in  the  last 
section,  we  obtain, 

M 

Pc  (T,M/{p^(j  ))  ) = I!  P(wk)3u4 

. Tfp''tn^<  j>]| (6.14) 

where  S(wk)  are  the  nodes  li,  on  the  path  to  class  wk, 

• a 

I*  denotes  the  set  of  coun ts <n^ ( j > > such  that, 
^ni(j)  wkfewO(li) 

< ^n^(j)  if  wkCUKli)  , 

40t.c  tdBCA-] 


and  PrCn^(j)3  denote  t he  ma rg i na  I probabilities  of  the 
counts  n^(j)  which  is  a binomial  distribution  given  by 


PrCn^lj  )3 


Nt  I 


Cp^'(j)3^  [1-p^(j)D 


Cn^(j)3*  rNt-n^(j)3! 


For  the  one-step  rule*  the  decision  is  to  classify  a 
sample  X whose  components  are  x1=j1,  and  x2=j?,  into  class 
wk  if, 

* • • • 
n)^(j1  j2)  = Max  { n^  ( j 1 ) . n^  ( j 2 ) ) , 

and  decide  ties  arbitrarily* 

From  the  above  rule,  and  from  a generalization  of  the 
case  where  a single  feature  is  used  (such  as  equ.  6.6),  it 
follows  that  the  performance  of  the  one-step  rule  averaged 
over  the  sampling  distribution,  is  given  by, 

M »y\ 


Pc  (OS,M/{pf  ( j)  ))  = X*-(wk) 

K*» 


I L L^k(  j1  , j2).p^^  ( jt)  .p^(  j2) 


* TlP'-t'iini)3.PrCn^{  j2)3J 
(6.15) 

whe  re  , Sk(  j1  ,)2)  * 1 if# 

n^(j1  ).n^(  )2)  = Max  < n^  ( j 1 ) .ni  ( j 2 ) ) 

and  Sk(j1#j2)  * 0 otherwise. 


Experimental  Results: 


Equations  (6.14)  and  (6.15)  were  used  to  determine  the 
performances  of  the  tree  and  one-shot  schemes  as  a function 
of  sample  size.  The  table  below  shows  the  error  rate  for  the 


165 


various  cases  considered.  In  the  first  examplet  the  tree 
does  better  than  the  one-shot  method  for  all  sample  sizes* 
though  as  sample  size  becomes  large*  the  latter  begins  to 
catch  up.  Example  2 shows  the  situation  as  sample  size 
becomes  large.  In  this  case*  for  sample  size  less  than  4* 
the  tree  does  better*  though  as  sample  size  increases*  the 
one  shot  does  better.  For  infinite  sample  size*  one  would 
expect  the  one-shot  to  be  consistently  better  in  all  cases. 
Graph  3 depicts  the  variation  of  the  error  rate  with  sample 
size  for  both  these  examples. 


f 


TR£E  CL^SS^\-^ER 


ONE  -STtP  CV_KSS\?\£W, 

(T)  Exa^wple  1 

Example  Z 


Example 

Sample 

size 

1 : 

Ideal  OS 
E rror 

Ideal  Tree 
Error 

Estimated 
OS  error 

Estimated 
Tree  error 

? 

. 57 

. 5775 

.7621 

.6336 

4 

. 57 

.5775 

.6784 

.6171 

8 

. 57 

.6226 

.6088 

• 57 

.6080 

.6053 

Example 

2: 

2 

.6375 

.7563 

.7171 

4 

.6375 

.6712 

.7015 

8 

.6375 

.6201 

.6823 

12 

. 56  25 

.6375 

.6020 

.6706 

This  chapter  has  investigated  the  relationship  between 
complexityt  sample  size,  and  performance.  Complexity  can  be 
regarded  as  composed  of  measurement  or  quantization 
complexity,  and  oecision  complexity.  Wc  have  shown  that  for 
a given  sample  size,  there  exists  an  optimal  quantization 
complexity  which  increases  with  sample  size  and  the  number 
of  classes,  M.  We  have  also  shown  that  in  certain  cases,  a 
scheme  with  a lower  degree  of  decision  complexity  (i.e.  one 
wnich  distinguishes  between  fewer  classes  at  each  stage), 
tan  perform  better  than  one  with  a greater  decision 
complexity  (such  as  a one-step  classifier),  when  the  sample 
size  is  small. 


167 


T . Main  Results  and  Directions  for  Further  Research 

This  study  has  led  to  the  follouing  main  results  regarding  the 
analysis  and  design  of  rrultistage  tnulticlass  pattern  recognition 
s chemes  : 

(1)  Most  parametric  and  non-par amet r i c multistage  pattern 
classification  schemes  tsed  in  practiset  can  be  described  by  the 
theoretical  model  analy2ed  in  Chapter  2.  Certain  concepts  of 
admissibility  and  optimality  developed  in  heuristic  programming, 
have  been  extended  in  this  work  to  multistage  statistical  pattern 
classification  schemes  which  trade  measurement  cost  for 

m i sc  I a s s i f i c a t i on  risk.  The  conjecture  by  some  earlier 

researchers  that  there  may  not  exist  Bay es-adm i s s i b t e search 
strategies  which  do  net  measure  all  features  is  disproved  tn  our 
s t uo  y , 

(2)  The  two  types  of  search  strategies  derived  for  this 
general  model  require  informed  heuristics  to  improve  their  search 
efficiency  and  to  test  for  termination  conditions.  We  have 
investigated  new  methods  of  obtaining  lower  and  upper  bounds  cn 
the  m i s c la s s i f i c a t i on  risk,  given  a subset  of  observations  on  the 
test  sample.  Such  methods  have  been  used  to  derive  bounds  fer 
discrete  features  which  are  conditionally  statistically 
independent,  or  satisfy  a first-order  tree  dependence.  The 
feasibility  of  computing  bounds  on  the  Euclidean  distance  cr 
similarity  measures  for  binary  vectors,  establishes  the  efficacy 
of  our  model  in  implementing  nearest  neighbour  classification 
schemes. 

Similar  'oounding'  techniques  can  be  used  for  more  elaborate 
• • ■ m t % cf  feature  dependence,  such  as  second-order  and  higher 
. ■ - r «jr(endence  models.  Cur  analyses  assumed  that  the  features 


were  discrete  and  non-metric,  '“luch  work  remains  to  be  done  in 
determininy  whether  bcurds  on  the  risk  can  be  derived  for  the  case 
of  continuous  valued  features  ano  used  in  formulating  admissible 
s trategies . 

(3)  We  have  shown  that  hierarchical  classifiers  are  a special 
case  of  the  general  state-space  model.  Our  analysis  has  focused  cn 
decision  trees  whose  roce  decisions  are  statistically  independent. 
We  have  proved  that  even  under  the  independence  a s sump t ion , 
opltmiring  the  perforearce  of  each  node  classifier  ind i v idua L Ly ^ 
does  not  optimize  the  overall  tree  performance. 

(4)  A new  systematic  approach  to  optimal  tree  design  has  been 
presented  in  this  work.  This  method  consists  of  decomposing  the 
design  problem  into  three  phases  viz.,  tree  skeleton  design, 
feature  assignment  to  its  nodes  and  node  decision  policy  design. 
Optimal  solutions  to  each  design  phase  are  obtained  by  using  the 
recursive  formulation  of  dynamic  programming. 

(5)  When  the  features  are  discrete  and  non-metric,  the  design 
of  the  optimal  decision  policy  at  each  node,  requires  excessive 
c omp u t a t i ona I resources.  Methods  of  clustering  decision  rules  and 
efficient  technioues  cf  discarding  suboptimal  sets  of  rules  have 
been  developed  in  Chapter  5.  Further  investigation  of  such  methods 
is  needed  for  design  problems  wherein  the  node  decisions  are 
i nterdependent  . 

(6)  feature  ranking  and  a method  of  bounding  the  performance 
of  a partially  designed  tree,  have  been  proposed  in  this  study,  as 
means  of  reducing  the  complexity  of  the  feature  assignment 
problem.  Improved  bounds  using  parametric  models  of  feature 
distributions  might  aid  greatly  in  reducing  the  execution  time  cf 
b r a n c h -a nd -bo und  algorithms  such  as  the  one  presented  in  Chapter 
5 . 


U.3 

(7)  The  anal,  sis  regarding  the  effects  of  finite  sample  size  cn 
the  performance  of  hierarchical  classifiers  has  led  to  two  new 
results: 

(a)  For  a fixed  sample  size,  there  exists  an  optimal 
q Ltan  t i za  t i on  complexity  for  each  feature  used  in  a decision  tree 
used  for  discriminating  M classes.  The  optimal  complexity 
increases  with  increasing  sample  size  and  the  number  of  classes  to 
be  oi St  ingui shed . Our  analysis  assumed  that  all  features  were 
quantized  into  the  sawe  number  of  levels.  Based  on  the  observed 
relationship  between  the  optimum  complexity  and  the  numoer  of 
classes,  it  is  our  conjecture  that  a feature  used  higher  up  in  a 
tree  should  be  quantized  into  more  levels  than  one  used  further 
away  from  the  root.  fn  analysis  similar  to  that  given  in  chapter 
6 might  validate  this  hypothesis. 

(b)  For  small  sample  sizes,  we  have  shown  analytically  that 
certain  irulticlass  prcblems  are  better  solved  using  a hierarchical 
classtfter  than  a one-step  classifier,  even  though,  in  theory 
(i.e.  given  perfect  krowledge  of  the  probability  functions),  the 
one-step  method  would  perform  better  in  all  cases.  We  have  as 
yet,  no  way  to  characterize  problems  for  which  this  "small  sample 
phenomenon'  occurs.  One  would  like  to  have  a general  set  cf 
guidelines  on  the  most  effective  way  of  using  a finite  sample  to 
design  a hierarchical  classifier. 

This  dissertation  provides  a broad  spectrum  of  admissible 
strategies  for  multistage  multiclass  recognition  problems,  and 
offers  a set  of  optimization  procedures  for  the  automated  design 
of  optimal  hierarchical  classifiers. 


Re  f e re nc e s 


C1D  Ball  G«  H.  ("66),  **A  Comparison  of  Some  Cluster 

Seeking  Techniques",  Tech.  R ep  R A D C -T R-66-5 1 4 , SRI, 
Stanford,  Ca. 

[23  Bellman  R.E.,  Dreyfus  S«E»,  Applied  Dynamic 
Programming,  Princeton  University  Press, 

Princeton,  N .J  . , 1962. 

C3D  Chang  C.  Y.,  "Dynamic  Programming  as  applied  to  Feature 
Subset  Selection  in  a Pattern  Recognition  System", 

IEEE  Trans,  on  S.tl.C.,  Vol.SMC*3,pp. 166*171, 


March,  1973. 

[43  Freidman  J.,  "A  Variable  Metric  Decision  Rule  For 
Nonparamet ri c C I assi t icat ion",  S lAC-PUB-1573 , SLAC, 
Stanford  California,  April,  1975. 

C53  Fu  k.  S.,  Sequential  Methods  in  Pattern  Recognition  and 
Machine  Learning,  Academic  Press,  1968. 

[63  Fukunaga  K. , Narendra  P.M.,  "A  Branch  and  Bound 
Algorithm  for  Computing  k-Ncarest  Neighbors", 
lEFE  Trans.  Comp.,  Vol.  C-24,  Nc.  7, July, 1974. 

[73  Hall  P.  A.  V. , "Br an c h-and  Round  and  Beyond",  Proc.  of 
Second  Joint  International  Conference  on  Artificial 
Intelligence,  1971. 

[83  kanal  L.,  "Patterns  in  Pattern  Re cogn i t ion : 1 968- 1 9 7 4" , 
IEEE  Trans,  on  Info.  Theory,  November,  1974. 

[93  Knuth  D.  ('71),  "Optimum  Binary  Search  Trees",  Acta 
Informat  »Cd,  Vol.  1,  pp.  14-25  • 

[103  Meisel  W.S.,  M i c ha  lopou  I o s D.A.,  "A  Partitioning 

Algorthm  with  Application  in  Pattern  Classification  and 
the  Optimisation  of  Decision  Trees",  IEEE  Trans, 
on  Computers,  Vol.  C-22,  pp.93-1C3,  January  197?. 

[113  Nadler  M.,"Frror  and  Reject  Rates  in  a Hierarchical 
Pattern  Recogniser",  IEEE  Trans.  Comp.,  Vol.C-20, 


172 


December*  19  71. 

[123  Hart  P.E.*  Nilsson  N.j.*  Raphael  B.*  "A  Formal  Basis 

for  the  Heuristic  Determination  of  Minimum  Cost  Paths'** 
IEEE  Trans,  of  Systems  Science  and  Cybernetics* 

July*  1968. 

[133  Stoeffel  J.  C.*  "On  Discrete  Variables  In  Pattern 
Recognition",  Ph.O  Thesis*  Syracuse  Univ.*  1972. 

[143Wu  C.*  Landgrebe  D.*  Swain  P.,  "The  Decision  Tree 

Approach  to  Classification*  TR-EE  75-17*  Purdue  Univ., 
May*  1975. 

[153  Bell  D.  A.*  "Decision  Trees  In  Pattern  Recognition", 
Comp. Sc.  TM 66  * N at i ona I Physical  Labo r a t or y * U .K . , 1 9 7 4 . 

[163  Mattson  R.L.*  Dammann  J.E.*  "A  Technique  For  Detecting 
and  Coding  Subclasses  in  Pattern  Recognition 
Problems",  IBM  Journal*  July  1965. 

[173  Kanal  L. * Chandr asekar an  B.*  " Cn  Dimensionality  and 
Sample  Sire  In  Statistical  Pattern  Recognition", 

Proc.  NEC*  2-7,  1968. 

[183  Hart  P.*  "Searching  Probabilistic  Decision  Trees",  AI 
Group  Tech.  Note  No.  2*  SRI  Project  7494*  Stanford 
Research  Inst.  Stanford*  California*  1969. 

[193  Ball  G.H.*"A  Comparison  Of  Some  Cluster  Seeking 
Techniques",  SRI  Tech.  Rep.  No.  R A D C -T R-66-5 1 4 , 
November  * 1966 . 

[203  Mucciardi  A.N.*  Gose  E.  E.*"A  Comparison  Of  Seven 
Techniques  for  Choosing  Subsets  of  Pattern 
Recognition  Properties",  IEEE  Trans.  Comp. 

Vol.  C-20*  September  1971. 

[213  Fukunaga  K.*  Introduction  To  Statistical,  Pattern 
Recognit  ion*  Academic  Press*  1 972  . 

[223  Winston  P.  ,"A  Heuristic  Program  That  Constructs 

Decision  Trees",  MIT  Project  MAC*  Al  Memo*  173*  1969. 

C233  Chow  C.*  Liu  C.*"  Approximating  Discrete  Probability 


173 


Distributions  and  Dependence  Trees"*  IEEE  Trans,  on 
Int.  Theory,  Vol.  lT-14,No.  3,  pp. 462-467,  1968. 

[243  Reinwald  L .T . , Soland  R.M.,  "Conversion  of  Limited 
Entry  Decision  Tables  To  Optimal  Computer 
Programs  I:  Minimum  Average  Processing  Time", 

JACM  Vol.  13,  PP.339,  1966. 

[253  Reinwald  L.T.,  Soland  R.M.,  "Conversion  of  Limited 
Entry  Decision  Tables  To  Optimal  Computer 
Programs  II:  Minimum  Storage  Requirements", 

JACM  Vol.  14,  pp.742,  1967. 

[263  Pollack  S.L., Hicks  H.T.,  Harrison  U.J.,  Decision 
Tables-  Theory  and  Practice,  Wiley  Int ersc ience , 

New  York , 1971 . 

[273  Gaffey  W.R.,"  Discriminatory  A ra ly s is :Per f ect 

Discrimination  As  The  Number  of  Variables  Increases", 
Rep. No. 5 Project  No.  21-49-004,  USAF  School  Of 
Aviation  Medicine,  Randolf  Field,  Texas, 

February  1 95 1 . 

[283  Wald  A.,  Seouential  Analysis,  Wiley,  New  York,  1947. 

[293  Watanabe  S.,  Pakvasa  , "Subspace  Methods  In  Pattern 
Recognition",  proc.  of  First  Joint  International 
Conf.  on  Pattern  Recognition,  Washington  D.C.  1973. 

[303  Therrien  C.W.,  "A  Generalized  Approach  To  Linear 

Methods  Of  Feature  Extraction",  Tech.  Note  1974-59, 
Lincoln  Lab.,  MIT,  December  1974. 

[313  Friedman  J.,  Baskett  F.,  Shustek  L.J.,  "A  Relatively 
Efficient  Algorithm  For  Finding  Nearest  Neighbours", 
SLAC-PUB -1 448 , CS-445,  June  1974,  Stanford, 

Calif . 1 974, 

[323  Whittle  P.  , Optimisation  under  Constraints-  Theory 
and  Applications  of  Nonlinear  Programming, 

New  York  Wiley,  1971. 

[333  Hughes  G.F.,  "On  the  Mean  Accuracy  Of  Statistical 


Pattern  R ec ogn i le r s ” , I EEE  Trans.  IT,  Vol. IT-14, 
pp.  55-63,  1969, 

C343  Chandras ek aran  B,,  Jain  A.,  “Optimum  Complexity  and 
Independent  Measurements",  IEEE  Trans.  Computers, 

Vol.  C-23,  Mo.  1,  January  1974. 

[353  Cover  T.M.,  "The  best  two  independent  measurements  are 
not  two  be  St", (Cor resp)  IEEE  Trans.  SMC, January  1974. 


UNCLASSIFIED 


SCCUWiTV  CL  ASStnC  ATiON  ThiS  PAGE  D0tm  Enfred) 


REPORT  DOCUMENTATION  PAGE 


' RtPO«T  number 


'2  GOVT  ACCESSION  NO 


^\FOSR-TR-  7 7 - 0 8 2..' 


4.  JlTLt  (Kd  Subllllm) 

OPTIMAL  AND  HEURISTIC  SYNTHESIS  OF 
HIERARCHICAL  CLASSIFIERS 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


3 recipient's  catalog  number 


5.  TYPE  OF  REPORT  4 PERIOD  COVERED 


Technical  report 


PCRPOPMING  ORG.  REPORT  NUMBER 


7 author^*; 


Ashok  V.  Kulkarni 


9.  PERFORMING  organization  NAME  AND  ADDRESS 

Department  of  Computer  Science 
University  of  Maryland 
College  Park,  Maryland  20742 


I I.  controlling  office  name  and  address 

National  Science  Foundation,  Engineering  Division 
1800  G Street,  N.W. 

Washington,  D.C.  20550 


• MONITORING  AGENCY  NAME  6 AOORESS'l/  difUtml  from  ControiUni  OtllC0) 


6.  CONTRACT  OR  GRANT  NUMBERr*; 

ENG  73-04099  & 

AFOSR  76-2901 


10.  program  element,  project.  TASK 

AREA  d WORK  UNIT  NUMBERS 


12.  REPORT  date 

August  1976 


13.  NUMBER  OF  PAGES 

174 


IS.  security  CLASS,  fof  r«pe»fO 


19a.  DECLASSIFICATION/ downgrading 

schedule 


19.  distribution  statement  (ot  thit  Report) 


Distribution  of  this  document  is  unlimited. 


*7.  distribution  statement  (oI  lh»  mbatrmct  •ntmrbd  In  Block  70,  //  dltforont  troai  Rtpott) 


18  Supplementary  notes 

Controlling  Office  Name  & Address  (as  in  #11)  also: 
Air  Force  Office  of  Scientific  Research 
Bolling  AFB, 

Washington,  D.C.  20332 


19  KEY  WORDS  (Continum  on  fovorto  •(dm  if  nocoftary  and  IdanHfy  by  block  nuinbar) 


20  abstract  fConi/nua  on  rarafaa  a/da  /f  p*<ra«*afy  pnd  Idaml/y  by  bloek^u/nbaj)  ....A-Ar..! 

Multistage  schemes  such  as  hierarchical  classifiers  have  been  found  useful 
For  many  multiclass  pattern  recognition  tasks.  This  dissertation  investigates 
the  theoretical  properties  of  a general  model  of  multistage  multiclass  recognition 
schemes.  The  generality  of  the  model  allows  one  to  describe  a large  class  of  para- 
netric  and  non-parametric  schemes  commonly  used  in  terms  of  the  model  parameters. 
Two  classes  of  admissible  and  optimal  strategies  for  obtaining  the  optimal  decis- 
ion are  analyzed.  These  strategies  employ  lower  and  upper  bounds  on  a risk  func- 
tion to  improve  the  search  efficiency.  New  methods  of  computing  the  bounds  are 


DD  1473  EDITION  OF  1 NOV  65  1$  OBSOLETE 


UNCLASSIFIED 


security  classification  of  This  page  (Wh0ft  Dmtm  Enlmr^d) 


\ 


♦ * 


UNCIASSIFIEO 


'FCUOITV  CLA5Sir|C*’’ION  OF  THIS  M»OE  fHh.,-.  !ltH«  Fnlsrrrf) 


REPORT  DOCUMENTATION  PAGE 

rf.ao  ins  i ructions 

BFFORK  COMIM.Fl  IN'r,  FORM 

> report  NUMOEP 

Al’USU-TR*  7 7 - U » b ^ 

Z.  GOVT  ACCESSION  NO 

1 RECIPIENT'S  cat  ALOG  NUMBER 

* Tl  TL  E ('«nd  5ubtfr/*; 

OPTIMAL  AND  HEURISTIC  SYNTHESIS  OF 
HIERARCHICAL  CLASSIFIERS  / 

5.  TYPE  OP  REPORT  ft  PERIOD  COVERED 

1 111  e rim 

*.  performing  ORO,  REPO^JF  number 
1 

1 AUTMORC*; 

Ashok  V.  Kulkarni 

«.  CONTRACT  OR  grant  nUMBERF«J 

AFOSR  76-2901  ' 

9 PERFCPMING  ORGANI7ATION  NAME  ANQ  ADDRESS 

Department  of  Computer  Science  / 

University  of  Maryland 
College  Park,  Maryland  20742 

^0.  program  element,  project.  TASK 
AREA  ft  WORK  UNIT  NUMBERS 

61102F'' 

2304/ A 2 

" controlling  OFFICE  NAME  AND  ADDRESS 

Air  I- on  r t '1 1 ir  e of  '^rionlilic  Rescaroh/NM 
Uollinq  A L H IK'  20,132 

12.  REPORT  DATE 

August  1976 

13  number  OF  p ages 

174 

14  MONiTOPiNG  AGENCY  nAME  A AOORESSf/f  diftormt  from  ContrelHn§  Otfleo) 

•5.  SECURITY  CLASS,  (oi  iHa  faporl) 

UNCLASSl  I'l  FI) 

tfta.  OE Cl  ASS) FiCATf on/ downgrading 
scmeoule 

>6  OlSTRieuYtON  STATEMENT  (of  thio  Hoport) 

A|i[iroveil  for  jniblic  rclcnsc;  di si ribiil ion  unlimited. 

>7  DfSTRiBUTlON  STATEMENT  (of  Ih*  obotfct  onlorod  In  Block  70,  It  dllfortU  from  Report) 

'•  SUPPLEMeNTABY  NOTES 


19  K t v'  fCnntinijt  on  /»¥*r$0  old*  It  n*c*»  ^ry  mnd  IdontUy  by  block  numbor) 


•lO  Aftt  T n AC  T <foniFnu#  on  r#  ve»  f • #/  n«f  • * iflfv  fty  &/ocfc^umh«i)  , _ . _ , 

I MuUistiigp  schomos  such  as  hierarchical  classifiers  have  been  found  useful 
/or  many  multiclass  pattern  recognition  tasks.  This  dissertation  investigates 
the  theoretical  properties  of  a general  model  of  multistage  multiclass  recognition 
'.chemes.  The  generality  of  the  model  allows  one  to  describe  a large  class  of  para- 
lietric  and  nnn-parametr ic  schemes  commonly  used  in  terms  of  the  model  parameters, 
[wo  classes  of  admissible  and  optimal  strategies  for  obtaining  the  optimal  decis- 
ion are  analyzed.  These  strategies  employ  lower  and  upper  bounds  on  a risk  func- 
tion to  improve  the  search  efficiency.  New  methods  of  computinci  the  bounds  are 


u 


L 


/ ■ 


DD  1473  FOlTIONOr  1 NOV  AA  IS  OBAIM.TTT 


IIMri  «l 


SEC'JWITY  classification  of  this  PAOEI-HTuct  D„,  Enfrta) 


20.  Abstract  (Concluded) 

investigated  for  the  cases  when  the  features  are  class-conditionally  statisti- 
cally independent  and  where  they  satisfy  a first-order  tree  dependence  rela- 
tion. Bounds  are  also  derived  for  use  in  nearest-neighbor  classification 
schemes  eriiploying  a Euclidean  distance  measure  and  various  similarity  mea- 
sures for  non-metric  feature  vectors. 

Hierarchical  classifiers  are  special  types  of  multistage  recognition 
schemes  wherein  at  each  stage  certain  classes  are  rejected  from  consideration 
as  labels  of  the  test  sample.  Theoretical  properties  of  decision  trees  whose 
node  decisions  are  statistically  independent  are  investigated.  Even  under 
this  independence  assumption  the  optimal  tree  design  task  is  a complex  one.,. 

A three  phase  decomposition  of  the  tree  design  problem  is  proposed  viz.'^ 
tree  skeleton  design,  feature  selection  at  its  nodes  and  decision  function 
design  at  each  node.  Optimal  solutions  to  each  design  phase  are  obtained 
using  a dynamic  programming  formulation. 

These  optimal  design  methods  rapidly  become  cumbersome  in  computational 
resources  as  the  number  of  features  and  classes  increase.  This  study  proposes 
various  techniques  for  reducing  the  computational  complexity  incurred  in 
finding  the  optimal  features  to  be  measured  at  each  node  and  the  optimal  de- 
cision policy.  A method  of  clustering  decision  rules  and  rejecting  sets  of 
suboptimal  rules  without  evaluating  each  individual  one  is  prooosed.  Feature 
ranking  and  a branch-and-bound  method  are  described  for  reducing  the  possible 
feature  assignments  to  be  considered  in  finding  the  optimal  feature  measure- 
ment policy. 

In  practice,  che  decision  rules  at  the  nodes  have  to  be  estimated  from  a 

finite  set  of  design  samples.  This  work  investigates  the  relationship  betweer 

the  expected  tree  performance,  sample  size  and  the  number  of  states  (quantiza- 
tion levels)  of  each  feature.  It  is  shown  that  for  an  M-class  recognition 

scheme  using  a decision  tree,  there  exists  an  optimal  quantization  complexity. 

The  optimum  complexity  increases  with  sample  size  and  with  the  number  of 
classes  to  be  distinguished.  For  small  sample  sizes,  it  is  shown  that  a mul- 
tistage decision  scheme  can  have  a lower  error  rate  than  a single  stage  scheme 
which  uses  all  the  available  measurements  in  an  M-way  decision  rule. 


unclassified 

SECU^iTr  CL  AS&lFtCATlON  QF  Thi&  D»tm  Ent*r0d) 


